diff options
-rw-r--r-- | docs/Simple/main.pdf | bin | 91045 -> 91896 bytes | |||
-rw-r--r-- | docs/Simple/main.tex | 73 |
2 files changed, 39 insertions, 34 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf Binary files differindex c6311c8..45cb024 100644 --- a/docs/Simple/main.pdf +++ b/docs/Simple/main.pdf diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex index cbe4bc0..1796902 100644 --- a/docs/Simple/main.tex +++ b/docs/Simple/main.tex @@ -40,7 +40,7 @@ bnewbold@archive.org \\ \begin{abstract} -As part of its scholarly data efforts, the Internet Archive releases a citation +As part of its scholarly data efforts, the Internet Archive releases a first version of a citation graph dataset, named \emph{refcat}, derived from scholarly publications and additional data sources. It is composed of data gathered by the fatcat cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale @@ -100,7 +100,7 @@ Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at \url{https://archive.org/details/mag-2021-06-07}} the -\emph{PaperReferences} relation contains 1,832,226,781 edges across 123,923,466 +\emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466 bibliographic entities. Numerous other projects have been or are concerned with various aspects of @@ -113,16 +113,14 @@ citations is not expected to shrink in the future. \section{Dataset} -We release the first version of the Fatcat Reference dataset (refcat) +We release the first version of the \emph{refcat} dataset in an format used internally for storage and to serve queries (and which we call \emph{biblioref} or \emph{bref} for short). The dataset includes metadata -from fatcat and the Open Library Project, links to archived pages in -the Wayback Machine as well as inbound links from the English Wikipedia. +from fatcat and the Open Library Project and inbound links from the English Wikipedia. The format contains source and target (fatcat release and work) identifiers, a few attributes from the metadata (such as year or release stage) as well as -information about the match provenance (like match status or reason). For ease -of use, we include DOI as well, if available. +information about the match status and provanance. The dataset currently contains 1,323,423,672 citations across 76,327,662 entities (55,123,635 unique source and 60,244,206 unique target work identifiers). @@ -130,25 +128,28 @@ The majority of matches - 1,250,523,321 - are established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are established through fuzzy matching. -The majority of DOI based matches between ASREF and COCI overlap, as can be -seen in~\ref{table:cocicmp}. +The majority of DOI based matches between \emph{refcat} and COCI overlap, as can be +seen in~Table~\ref{table:cocicmp}. \begin{table}[] \begin{center} \begin{tabular}{ll} \toprule -\bf{Dataset} & \bf{Count} \\ +\bf{Set} & \bf{Count} \\ \midrule COCI (C) & 1,094,394,688 \\ - ASREF-DOI (A) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst - C $\cap$ A & \\ - C $\cup$ A & \\ - C $\setminus$ A & \\ - A $\setminus$ C & + \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst + C $\cap$ R & 1,007,539,966 \\ + C $\cup$ R & 1,390,278,521 \\ + C $\setminus$ R & 86,854,309 \\ + R $\setminus$ C & 295,884,246 \end{tabular} \vspace*{2mm} - \caption{Comparison between COCI and REFCAT-DOI, a subset of REFCAT where entities have a known DOI.} + \caption{Comparison between COCI and \emph{refcat-doi}, a subset of +\emph{refcat} where entities have a known DOI. At least 50\% of the 295,884,246 +references only in \emph{refcat-doi} come from links between datasets (GBIF, +DOI prefix: 10.15468).} \label{table:cocicmp} \end{center} \end{table} @@ -164,8 +165,8 @@ TODO: how matches are established and a short note on overlap with COCI DOI. \section{System Design} The constraints for the systems design are informed by the volume and the -variety of the data. The capability to run the graph whole derivation on a -single machine (commodity hardware) was a minor goal as well. In total, the raw inputs amount to a few +variety of the data. The capability to run the whole graph derivation on a +single machine was a minor goal as well. In total, the raw inputs amount to a few TB of textual content, mostly newline delimited JSON. More importantly, while the number of data fields is low, certain schemas are very partial with hundreds of different combinations of available field values found in the raw @@ -177,7 +178,7 @@ based structured data extraction tools. Each combination of fields may require a slightly different processing path. For example, references with an Arxiv identifier can be processed differently from references with only a title. Over 50\% of the raw reference data comes -from a set of eight field manifestations, as listed in +from a set of eight field set manifestations, as listed in Table~\ref{table:fields}. \begin{table}[] @@ -206,27 +207,31 @@ issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any va Overall, a map-reduce style approach is followed, which allows for some uniformity in the overall processing. We extract (key, document) tuples (as -TSV) from the raw JSON data and sort by key. Then we group documents with the -same key into groups and apply a function on each group in order to generate -our target schema (currently named biblioref, or bref for short) or perform -addition operations (such as deduplication). +TSV) from the raw JSON data and sort by key. We then group documents with the +same key and apply a function on each group in order to generate +our target schema or perform +additional operations such as deduplication or fusion of matched and unmatched references. The key derivation can be exact (like an identifier like DOI, PMID, etc) or -based on a normalization procedure, like a slugified title string. For -identifier based matches we can generate the target biblioref schema directly. -For fuzzy matching candidates, we pass possible match pairs through a -verification procedure, which is implemented for release entity schema pairs. -The current verification procedure is a domain dependent rule based -verification, able to identify different versions of a publication, -preprint-published pairs or or other kind of similar documents by calculating -similarity metrics across title and authors. The fuzzy matching approach is -applied on all reference documents, which only have a title, but no identifier. +based on a value normalization, like slugifying a title string. For identifier +based matches we can generate the target schema directly. For fuzzy matching +candidates, we pass possible match pairs through a verification procedure, +which is implemented for \emph{release entity} pairs. This procedure is a +domain dependent rule based verification, able to identify different versions +of a publication, preprint-published pairs and documents, which are +are similar by various metrics calculated over title and authors. The fuzzy matching +approach is applied on all reference documents without identifier (a title is +currently required). With a few schema conversions, fuzzy matching can be applied to Wikipedia articles and Open Library (edition) records as well. The aspect of precision and recall are represented by the two stages: we are generous in the match candidate generation phase in order to improve recall, but we are strict during -verification, in order to control precision. +verification, in order to control precision. Quality assurance for verification is +implemented through a growing list of test cases of real examples from the catalog and +their expected or desired match status\footnote{The list can be found under: +\url{https://gitlab.com/internetarchive/cgraph/-/blob/master/skate/testdata/verify.csv}. +It is helpful to keep this test suite independent of any specific programming language.}. \section{Future Work} @@ -264,7 +269,7 @@ Foundation}. We like to thanks various teams at the Internet Archive for providing necessary infrastructure, and also data processing expertise. We are also indebted to various open source software tools and their maintainers as well as open scholarly data projects - without those this work would be much -harder or not possible at all. +harder if possible at all. |