aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-07 04:59:04 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-07 04:59:04 +0200
commit95f21aea779b304f4f43aed12ef27ee8c954b0c8 (patch)
treeff3be0c214c6fffbc5475b7c991e1e92107f1960
parent6b6c80be450f3f8eeca201a15c4c2b83386e2a4c (diff)
downloadrefcat-95f21aea779b304f4f43aed12ef27ee8c954b0c8.tar.gz
refcat-95f21aea779b304f4f43aed12ef27ee8c954b0c8.zip
wip: complate table
-rw-r--r--docs/Simple/main.pdfbin91045 -> 91896 bytes
-rw-r--r--docs/Simple/main.tex73
2 files changed, 39 insertions, 34 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index c6311c8..45cb024 100644
--- a/docs/Simple/main.pdf
+++ b/docs/Simple/main.pdf
Binary files differ
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index cbe4bc0..1796902 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -40,7 +40,7 @@ bnewbold@archive.org \\
\begin{abstract}
-As part of its scholarly data efforts, the Internet Archive releases a citation
+As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
graph dataset, named \emph{refcat}, derived from scholarly publications and
additional data sources. It is composed of data gathered by the fatcat
cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
@@ -100,7 +100,7 @@ Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of
entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
\url{https://archive.org/details/mag-2021-06-07}} the
-\emph{PaperReferences} relation contains 1,832,226,781 edges across 123,923,466
+\emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466
bibliographic entities.
Numerous other projects have been or are concerned with various aspects of
@@ -113,16 +113,14 @@ citations is not expected to shrink in the future.
\section{Dataset}
-We release the first version of the Fatcat Reference dataset (refcat)
+We release the first version of the \emph{refcat} dataset
in an format used internally for storage and to serve queries (and which we
call \emph{biblioref} or \emph{bref} for short). The dataset includes metadata
-from fatcat and the Open Library Project, links to archived pages in
-the Wayback Machine as well as inbound links from the English Wikipedia.
+from fatcat and the Open Library Project and inbound links from the English Wikipedia.
The format contains source and target (fatcat release and work) identifiers, a
few attributes from the metadata (such as year or release stage) as well as
-information about the match provenance (like match status or reason). For ease
-of use, we include DOI as well, if available.
+information about the match status and provanance.
The dataset currently contains 1,323,423,672 citations across 76,327,662
entities (55,123,635 unique source and 60,244,206 unique target work identifiers).
@@ -130,25 +128,28 @@ The majority of matches - 1,250,523,321 - are established through identifier
based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are
established through fuzzy matching.
-The majority of DOI based matches between ASREF and COCI overlap, as can be
-seen in~\ref{table:cocicmp}.
+The majority of DOI based matches between \emph{refcat} and COCI overlap, as can be
+seen in~Table~\ref{table:cocicmp}.
\begin{table}[]
\begin{center}
\begin{tabular}{ll}
\toprule
-\bf{Dataset} & \bf{Count} \\
+\bf{Set} & \bf{Count} \\
\midrule
COCI (C) & 1,094,394,688 \\
- ASREF-DOI (A) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
- C $\cap$ A & \\
- C $\cup$ A & \\
- C $\setminus$ A & \\
- A $\setminus$ C &
+ \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
+ C $\cap$ R & 1,007,539,966 \\
+ C $\cup$ R & 1,390,278,521 \\
+ C $\setminus$ R & 86,854,309 \\
+ R $\setminus$ C & 295,884,246
\end{tabular}
\vspace*{2mm}
- \caption{Comparison between COCI and REFCAT-DOI, a subset of REFCAT where entities have a known DOI.}
+ \caption{Comparison between COCI and \emph{refcat-doi}, a subset of
+\emph{refcat} where entities have a known DOI. At least 50\% of the 295,884,246
+references only in \emph{refcat-doi} come from links between datasets (GBIF,
+DOI prefix: 10.15468).}
\label{table:cocicmp}
\end{center}
\end{table}
@@ -164,8 +165,8 @@ TODO: how matches are established and a short note on overlap with COCI DOI.
\section{System Design}
The constraints for the systems design are informed by the volume and the
-variety of the data. The capability to run the graph whole derivation on a
-single machine (commodity hardware) was a minor goal as well. In total, the raw inputs amount to a few
+variety of the data. The capability to run the whole graph derivation on a
+single machine was a minor goal as well. In total, the raw inputs amount to a few
TB of textual content, mostly newline delimited JSON. More importantly, while
the number of data fields is low, certain schemas are very partial with
hundreds of different combinations of available field values found in the raw
@@ -177,7 +178,7 @@ based structured data extraction tools.
Each combination of fields may require a slightly different processing path.
For example, references with an Arxiv identifier can be processed differently
from references with only a title. Over 50\% of the raw reference data comes
-from a set of eight field manifestations, as listed in
+from a set of eight field set manifestations, as listed in
Table~\ref{table:fields}.
\begin{table}[]
@@ -206,27 +207,31 @@ issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any va
Overall, a map-reduce style approach is followed, which allows for some
uniformity in the overall processing. We extract (key, document) tuples (as
-TSV) from the raw JSON data and sort by key. Then we group documents with the
-same key into groups and apply a function on each group in order to generate
-our target schema (currently named biblioref, or bref for short) or perform
-addition operations (such as deduplication).
+TSV) from the raw JSON data and sort by key. We then group documents with the
+same key and apply a function on each group in order to generate
+our target schema or perform
+additional operations such as deduplication or fusion of matched and unmatched references.
The key derivation can be exact (like an identifier like DOI, PMID, etc) or
-based on a normalization procedure, like a slugified title string. For
-identifier based matches we can generate the target biblioref schema directly.
-For fuzzy matching candidates, we pass possible match pairs through a
-verification procedure, which is implemented for release entity schema pairs.
-The current verification procedure is a domain dependent rule based
-verification, able to identify different versions of a publication,
-preprint-published pairs or or other kind of similar documents by calculating
-similarity metrics across title and authors. The fuzzy matching approach is
-applied on all reference documents, which only have a title, but no identifier.
+based on a value normalization, like slugifying a title string. For identifier
+based matches we can generate the target schema directly. For fuzzy matching
+candidates, we pass possible match pairs through a verification procedure,
+which is implemented for \emph{release entity} pairs. This procedure is a
+domain dependent rule based verification, able to identify different versions
+of a publication, preprint-published pairs and documents, which are
+are similar by various metrics calculated over title and authors. The fuzzy matching
+approach is applied on all reference documents without identifier (a title is
+currently required).
With a few schema conversions, fuzzy matching can be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision
and recall are represented by the two stages: we are generous in the match
candidate generation phase in order to improve recall, but we are strict during
-verification, in order to control precision.
+verification, in order to control precision. Quality assurance for verification is
+implemented through a growing list of test cases of real examples from the catalog and
+their expected or desired match status\footnote{The list can be found under:
+\url{https://gitlab.com/internetarchive/cgraph/-/blob/master/skate/testdata/verify.csv}.
+It is helpful to keep this test suite independent of any specific programming language.}.
\section{Future Work}
@@ -264,7 +269,7 @@ Foundation}. We like to thanks various teams at the Internet Archive for
providing necessary infrastructure, and also data processing expertise. We are
also indebted to various open source software tools and their maintainers as
well as open scholarly data projects - without those this work would be much
-harder or not possible at all.
+harder if possible at all.