diff options
-rw-r--r-- | docs/Simple/main.pdf | bin | 97848 -> 99828 bytes | |||
-rw-r--r-- | docs/Simple/main.tex | 107 |
2 files changed, 70 insertions, 37 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf Binary files differindex 8fe89a9..469fa95 100644 --- a/docs/Simple/main.pdf +++ b/docs/Simple/main.pdf diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex index e88e6fd..36f2074 100644 --- a/docs/Simple/main.tex +++ b/docs/Simple/main.tex @@ -42,13 +42,14 @@ bnewbold@archive.org \\ As part of its scholarly data efforts, the Internet Archive releases a citation graph dataset (ASREF) derived from scholarly publications and additional data sources. It is composed of data gathered by the fatcat cataloging -project\footnote{\url{https://fatcat.wiki}} and related web-scale crawls -targeting primary and secondary scholarly outputs. In addition, relations are -worked out between scholarly publications, web pages and their archived copies, -books from the Open Library project as well as Wikipedia articles. This first -version of the graph consists of over X nodes and over Y edges. We release this -dataset under a Z open license under the collection as an archive -item\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code +project\footnote{\url{https://fatcat.wiki}}, related web-scale crawls targeting +primary and secondary scholarly outputs, as well as metadata from the Open +Library\footnote{\url{https://openlibrary.org}} project, information about +archived web pages found in the Wayback +Machine\footnote{\url{https://web.archive.org}} and +Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the +graph consists of 1,323,423,672 citations. We release this dataset under a CC0 Public Domain Dedication, accessible through an +archive collection\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code used in the derivation process is releases under an MIT license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}. \end{abstract} @@ -60,9 +61,11 @@ license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}. The Internet Archive releases a first version of a citation graph dataset derived from a raw corpus of about 2.5B references gathered from metadata and -from data obtained by PDF extraction tools such as GROBID\cite{lopez2009grobid}. +from data obtained by PDF extraction tools such as +GROBID\cite{lopez2009grobid}. Additionally, we consider integration with +metadata from Open Library, the Wayback Machine and Wikipedia. The goal of this report is to describe briefly the current contents and the -derivation of the Archive Scholar Citations Dataset (ASC). We expect +derivation of the Archive Scholar Reference Dataset (ASREF). We expect this dataset to be iterated upon, with changes both in content and processing. Modern citation indexes can be traced back to the early computing age, when @@ -94,9 +97,9 @@ publications\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph\footnote{A recent copy has been preserved at \url{https://archive.org/details/mag-2021-06-07}} is comprised of a number of entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} -with PaperReferences being one relation among many others. As of 2021-06-07 the -PaperReferences relation contains 1,832,226,781 edges across YYY bibliographic -entities. +with \emph{PaperReferences} being one relation among many others. As of 2021-06-07 the +\emph{PaperReferences} relation contains 1,832,226,781 edges across 123,923,466 +bibliographic entities. Numerous other projects have been or are concerned with various aspects of citation discovery and curation, among them Semantic Scholar, CiteSeerX or @@ -108,15 +111,45 @@ citations is not expected to shrink in the future. \section{Citation Dataset} -We release the first version of the ASREF dataset in an format used internally -for storage and display (and which we call \emph{biblioref}). The format -contains source and target fatcat release and work identifiers, as well as few -attributes from the metadata (such as year or release stage) as well as +We release the first version of the Archive Scholar Reference (ASREF) dataset +in an format used internally for storage and to serve queries (and which we +call \emph{biblioref} or \emph{bref} for short). The dataset includes metadata +from fatcat and the Open Library Project, links to archived pages in +the Wayback Machine as well as inbound links from the English Wikipedia. + +The format contains source and target (fatcat release and work) identifiers, a +few attributes from the metadata (such as year or release stage) as well as information about the match provenance (like match status or reason). For ease of use, we include DOI as well, if available. -The dataset currently contains X unique bibliographic entities and Y citations. +The dataset currently contains 1,323,423,672 citations across 76,327,662 +entities (55,123,635 unique source and 60,244,206 unique target work identifiers). +The majority of matches - 1,250,523,321 - are established through identifier +based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are +established through fuzzy matching. + +The majority of DOI based matches between ASREF and COCI overlap, as can be +seen in~\ref{table:cocicmp}. + +\begin{table}[] + \begin{center} + \begin{tabular}{ll} +\toprule +\bf{Dataset} & \bf{Count} \\ +\midrule + COCI (C) & 1,094,394,688 \\ + ASREF-DOI (A) & 1,303,589,144 \\ + C $\cap$ A & \\ + C $\cup$ A & \\ + C $\setminus$ A & \\ + A $\setminus$ C & + \end{tabular} + \vspace*{2mm} + \caption{Comparison between COCI and ASREF-DOI, a subset of ASREF with DOI.} + \label{table:cocicmp} + \end{center} +\end{table} TODO: how matches are established and a short note on overlap with COCI DOI. @@ -144,22 +177,22 @@ Table~\ref{table:fields}. \begin{center} \begin{tabular}{ll} \toprule - \bf{Fields} & \bf{Share} \\ + \bf{Fields} & \bf{Percentage} \\ \midrule \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 14\% \\ - \multicolumn{1}{l}{DOI} & 14\% \\ + \multicolumn{1}{l}{\textbf{DOI}} & 14\% \\ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\% \\ - \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ U $\cdot$ V $\cdot$ Y} & 4\% \\ - \multicolumn{1}{l}{PMID $\cdot$ U} & 4\% \\ - \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ T $\cdot$ V $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{\textbf{PMID} $\cdot$ U} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y} & 4\% \\ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y} & 4\% \\ - \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ V $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y} & 4\% \\ \end{tabular} \vspace*{2mm} \caption{Top 8 combinations of available fields in raw reference data accounting for about 53\% of the total data (CN = container name, CRN = contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS = -issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value.} +issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.} \label{table:fields} \end{center} \end{table} @@ -189,10 +222,6 @@ candidate generation phase in order to improve recall, but we are strict during verification, in order to control precision. -\section{Fuzzy Matching Approach} -\section{Quality Assurance} - - \section{Future Work} As other dataset in this field we expect this dataset to be iterated upon. @@ -200,23 +229,27 @@ As other dataset in this field we expect this dataset to be iterated upon. \begin{itemize} \item The fatcat catalog updates its metadata continously\footnote{A changelog can currenly be followed here: - \url{fatcat.wiki/changelog}} and web crawls are conducted regularly. - Current processing pipelines cover raw reference snapshot creation and - derivation the graph structure. + \url{https://fatcat.wiki/changelog}} and web crawls are conducted + regularly. Current processing pipelines cover raw reference snapshot + creation and derivation the graph structure, which allows to rerun + processing based on updated data as it becomes available. - \item Metadata extraction from PDFs depends on machine learning - models, which in turn depend training sets. With additional crawls and + \item Metadata extraction from PDFs depends on supervised machine learning + models, which in turn depends training sets. With additional crawls and metadata available we hope to improve models used for metadata - extraction, reducing data extraction artifacts in the process. + extraction, improving yield and reducing data extraction artifacts in + the process. - \item As of this version, a significant number of raw reference - docs remain unmatched, which means that neither exact or fuzzy matching - can recover a link to a known entity. On the one + \item As of this version, a number of raw reference + docs remain unmatched, which means that neither exact nor fuzzy matching + can detect a link to a known entity. On the one hand, this can hint at missing metadata. However, parts of the data will contain a reference to a catalogued entity, but in a specific, dense and harder to recover form. + This also include improvements to fuzzy matching code. \end{itemize} + \section{Acknowledgements} This work is partially supported by a grant from the \emph{Andrew W. Mellon |