docs: cleanup and naming

author: Martin Czygan <martin.czygan@gmail.com> 2021-08-08 15:18:29 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-08-08 15:18:29 +0200
commit: bd66b58cded2c2c7e7b7e5d374434d6531dd70de (patch)
tree: 00417812b9787ab4492e2c590fcf1bf6f4b576e7 /docs/TR-20210730212057-IA-WDS-CG/main.tex
parent: bb64b3aa62267676302e75f0ca44157b514beec4 (diff)
download: refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.tar.gz
refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.zip
1 files changed, 0 insertions, 442 deletions
diff --git a/docs/TR-20210730212057-IA-WDS-CG/main.tex b/docs/TR-20210730212057-IA-WDS-CG/main.tex
deleted file mode 100644
index a7edac3..0000000
--- a/docs/TR-20210730212057-IA-WDS-CG/main.tex
+++ /dev/null
@@ -1,442 +0,0 @@
-\documentclass{article}
-
-
-
-\usepackage{arxiv}
-
-\usepackage[utf8]{inputenc} % allow utf-8 input
-\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
-\usepackage{hyperref}       % hyperlinks
-\usepackage{url}            % simple URL typesetting
-\usepackage{booktabs}       % professional-quality tables
-\usepackage{amsfonts}       % blackboard math symbols
-\usepackage{nicefrac}       % compact symbols for 1/2, etc.
-\usepackage{microtype}      % microtypography
-\usepackage{lipsum}		% Can be removed after putting your text content
-\usepackage{graphicx}
-\usepackage{natbib}
-\usepackage{doi}
-
-\title{Internet Archive Scholar Citation Graph Dataset}
-
-\date{August 10, 2021}	% Here you can change the date presented in the paper title
-%\date{} 					% Or removing it
-
-\author{ Martin Czygan \\
-	Internet Archive\\
-	San Francisco, CA 94118 \\
-	\texttt{martin@archive.org} \\
-	%% examples of more authors
-	\And
-	Bryan Newbold \\
-	Internet Archive\\
-	San Francisco, CA 94118 \\
-	\texttt{bnewbold@archive.org} \\
-	% \And
-	% Helge Holzmann \\
-	% Internet Archive\\
-	% San Francisco, CA 94118 \\
-	% \texttt{helge@archive.org} \\
-	% \And
-	% Jefferson Bailey \\
-	% Internet Archive\\
-	% San Francisco, CA 94118 \\
-	% \texttt{jefferson@archive.org} \\
-	%% \AND
-	%% Coauthor \\
-	%% Affiliation \\
-	%% Address \\
-	%% \texttt{email} \\
-	%% \And
-	%% Coauthor \\
-	%% Affiliation \\
-	%% Address \\
-	%% \texttt{email} \\
-	%% \And
-	%% Coauthor \\
-	%% Affiliation \\
-	%% Address \\
-	%% \texttt{email} \\
-}
-
-% Uncomment to remove the date
-%\date{}
-
-% Uncomment to override  the `A preprint' in the header
-\renewcommand{\headeright}{Technical Report}
-\renewcommand{\undertitle}{Technical Report}
-% \renewcommand{\shorttitle}{\textit{arXiv} Template}
-
-%%% Add PDF metadata to help others organize their library
-%%% Once the PDF is generated, you can check the metadata with
-%%% $ pdfinfo template.pdf
-\hypersetup{
-pdftitle={Internet Archive Scholar Citation Graph Dataset},
-pdfsubject={cs.DL, cs.IR},
-pdfauthor={Martin Czygan, Bryan Newbold, Helge Holzmann, Jefferson Bailey},
-pdfkeywords={Web Archiving, Citation Graph},
-}
-
-\begin{document}
-\maketitle
-
-\begin{abstract}
-As part of its scholarly data efforts, the Internet Archive releases a citation
-graph dataset derived from scholarly publications and additional data sources. It is
-composed of data gathered by the \href{https://fatcat.wiki}{fatcat cataloging project} and related
-web-scale crawls targeting primary and secondary scholarly outputs. In
-addition, relations are worked out between scholarly publications, web pages
-and their archived copies, books from the Open Library project as well as
-Wikipedia articles. This first version of the graph consists of over X nodes
-and over Y edges. We release this dataset under a Z open license under the
-collection at \href{https://archive.org/details/TODO-citation\_graph}{https://archive.org/details/TODO-citation\_graph}, as well as all code
-used for derivation under an MIT license.
-\end{abstract}
-
-
-% keywords can be removed
-\keywords{Citation Graph \and Scholarly Communications \and Web Archiving}
-
-
-\section{Introduction}
-
-The Internet Archive releases a first version of a citation graph dataset
-derived from a raw corpus of about 2.5B references gathered from metadata and
-from data obtained by PDF extraction tools such as GROBID\citep{lopez2009grobid}.
-The goal of this report is to describe briefly the current contents and the
-derivation of the Archive Scholar Citations Dataset (ASC). We expect
-this dataset to be iterated upon, with changes both in content and processing.
-
-Modern citation indexes can be traced back to the early computing age, when
-projects like the Science Citation Index (1955)\citep{garfield2007evolution}
-were first devised, living on in existing commercial knowledge bases today.
-Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
-- the first version of which contained 6,325,178 individual
-references\citep{shotton2013publishing}. Other notable sources from that time
-include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
-decade has seen an increase of more openly available reference dataset and
-citation projects, like Microsoft Academic\citep{sinha2015overview} and
-Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
-according to \citep{hutchins2021tipping} over 1B citations are publicly
-available, marking a tipping point for open citations.
-
-
-
-\section{Citation Graph Contents}
-
-
-
-% * edges
-% * edges exact
-% * edges fuzzy
-% * edges fuzzy reason (table)
-% * number of source docs
-% * number of target docs
-% * refs to papers
-% * refs to books
-% * refs to web pages
-% * refs to web pages that have been archived
-% * refs to web pages that have been archived but not on liveweb any more
-%
-% Overlaps
-%
-% * how many edges can be found in COCI as well
-% * how many edges can be found in MAG as well
-% * how many unique to us edges
-%
-% Additional numbers
-%
-% * number of unparsed refs
-% * "biblio" field distribution of unparted refs
-%
-% Potential routes
-%
-% * journal abbreviation parsing with suffix arrays
-% * lookup by name, year and journal
-
-
-\section{System Design}
-
-The constraints for the systems design are informed by the volume and the
-variety of the data. In total, the raw inputs amount to a few TB of textual
-content, mostly newline delimited JSON. More importantly, while the number of
-data fields is low, certain schemas are very partial with hundreds of different
-combinations of available field values found in the raw reference data. This is
-most likely caused by aggregators passing on reference data coming from
-hundreds of sources, each of which not necessarily agreeing on a common
-granularity for citation data and from artifacts of machine learning based
-structured data extraction tools.
-
-Each combination of fields may require a slightly different processing path.
-For example, references with an Arxiv identifier can be processed differently
-from references with only a title. Over 50\% of the raw reference data comes
-from a set of eight field manifestations, as listed in
-Table~\ref{table:fields}.
-
-\begin{table}[]
-    \begin{center}
-    \begin{tabular}{ll}
-\toprule
-        \bf{Fields}                                    & \bf{Share} \\
-\midrule
-        \multicolumn{1}{l}{CN|CRN|P|T|U|V|Y}    & 14\%                              \\
-        \multicolumn{1}{l}{DOI}                 & 14\%                              \\
-        \multicolumn{1}{l}{CN|CRN|IS|P|T|U|V|Y} & 5\%                               \\
-        \multicolumn{1}{l}{CN|CRN|DOI|U|V|Y}    & 4\%                               \\
-        \multicolumn{1}{l}{PMID|U}              & 4\%                               \\
-        \multicolumn{1}{l}{CN|CRN|DOI|T|V|Y}    & 4\%                               \\
-        \multicolumn{1}{l}{CN|CRN|Y}            & 4\%                               \\
-        \multicolumn{1}{l}{CN|CRN|DOI|V|Y}      & 4\%                               \\
-    \end{tabular}
-    \vspace*{2mm}
-    \caption{Top 8 combinations of available fields in raw reference data
-        accounting for about 53\% of the total data (CN = container name, CRN =
-contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
-issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value.}
-    \label{table:fields}
-\end{center}
-\end{table}
-
-Overall, a map-reduce style approach is followed, which allows for some
-uniformity in the overall processing. We extract (key, document) tuples (as
-TSV) from the raw JSON data and sort by key. Then we group documents with the
-same key into groups and apply a function on each group in order to generate
-our target schema (currently named biblioref, or bref for short) or perform
-addition operations (such as deduplication).
-
-The key derivation can be exact (like an identifier like DOI, PMID, etc) or
-based on a normalization procedure, like a slugified title string. For
-identifier based matches we can generate the target biblioref schema directly.
-For fuzzy matching candidates, we pass possible match pairs through a
-verification procedure, which is implemented for release entity schema pairs.
-The current verification procedure is a domain dependent rule based
-verification, able to identify different versions of a publication,
-preprint-published pairs or or other kind of similar documents by calculating
-similarity metrics across title and authors. The fuzzy matching approach is
-applied on all reference documents, which only have a title, but no identifier.
-
-With a few schema conversions, fuzzy matching can be applied to Wikipedia
-articles and Open Library (edition) records as well. The aspect of precision
-and recall are represented by the two stages: we are generous in the match
-candidate generation phase in order to improve recall, but we are strict during
-verification, in order to control precision.
-
-\section{Fuzzy Matching Approach}
-
-% Take sample of 100 docs, report some precision, recall, F1 on a hand curated
-% small subset.
-
-The fuzzy matching approach currently implemented works in two phases: match
-candidate generation and verification. For candidate generation, we map each
-document to a key. We implemented a number of algorithms to form these
-clusters, e.g. title normalizations (including lowercasing, whitespace removal,
-unicode normalization and other measures) or transformations like
-NYSIIS\citep{silbert1970world}.
-
-The verification approach is based on a set of rules, which are tested
-sequentially, yielding a match signal from weak to exact. We use a suite of
-over 300 manually curated match examples\footnote{The table can be found here:
-\href{https://gitlab.com/internetarchive/fuzzycat/-/blob/master/tests/data/verify.csv}{https://gitlab.com/internetarchive/fuzzycat/-/blob/master/tests/data/verify.csv}}
-as part of a unit test suite to allow for a controlled, continuous adjustement
-to the verification procedure. If the verification yields either an exact or
-strong signal, we include consider it a match.
-
-We try to keep the processing steps performant to keep the overall derivation
-time limited. Map and reduce operations are parallelized and certain processing
-steps can process 100K documents per second or even more on commodity hardware
-with spinning disks.
-
-\section{Quality Assurance}
-
-Understanding data quality plays a role, as the data is coming from a myriad of
-sources, each with possible idiosyncratic features or missing values. We employ
-a few QA measures during the process. First, we try to pass each data item
-through only one processing pipeline (e.g. items matched by any identifier
-should not even be considered for fuzzy matching). If duplicate links appear in
-the final dataset nonetheless, we remove them, prefering exact over fuzzy matches.
-
-We employ a couple of data cleaning techniques, e.g. to find and verify
-identifiers like ISBN or to sanitize URLs found in the data. Many of these
-artifacts stem from the fact that large chunks of the raw data come from
-heuristic data extraction from PDF documents.
-
-
-\section{Discussion}
-
-% need to iterate
-
-%\lipsum[2] %\lipsum[3]
-
-
-% \section{Headings: first level} % \label{sec:headings}
-%
-% \lipsum[4] See Section \ref{sec:headings}.
-%
-% \subsection{Headings: second level}
-% \lipsum[5]
-% \begin{equation}
-% 	\xi _{ij}(t)=P(x_{t}=i,x_{t+1}=j|y,v,w;\theta)= {\frac {\alpha _{i}(t)a^{w_t}_{ij}\beta _{j}(t+1)b^{v_{t+1}}_{j}(y_{t+1})}{\sum _{i=1}^{N} \sum _{j=1}^{N} \alpha _{i}(t)a^{w_t}_{ij}\beta _{j}(t+1)b^{v_{t+1}}_{j}(y_{t+1})}}
-% \end{equation}
-%
-% \subsubsection{Headings: third level}
-% \lipsum[6]
-%
-% \paragraph{Paragraph}
-% \lipsum[7]
-%
-%
-%
-% \section{Examples of citations, figures, tables, references}
-% \label{sec:others}
-%
-% \subsection{Citations}
-% Citations use \verb+natbib+. The documentation may be found at
-% \begin{center}
-% 	\url{http://mirrors.ctan.org/macros/latex/contrib/natbib/natnotes.pdf}
-% \end{center}
-%
-% Here is an example usage of the two main commands (\verb+citet+ and \verb+citep+): Some people thought a thing \citep{kour2014real, hadash2018estimate} but other people thought something else \citep{kour2014fast}. Many people have speculated that if we knew exactly why \citet{kour2014fast} thought this\dots
-%
-% \subsection{Figures}
-% \lipsum[10]
-% See Figure \ref{fig:fig1}. Here is how you add footnotes. \footnote{Sample of the first footnote.}
-% \lipsum[11]
-%
-% \begin{figure}
-% 	\centering
-% 	\fbox{\rule[-.5cm]{4cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
-% 	\caption{Sample figure caption.}
-% 	\label{fig:fig1}
-% \end{figure}
-%
-% \subsection{Tables}
-% See awesome Table~\ref{tab:table}.
-%
-% The documentation for \verb+booktabs+ (`Publication quality tables in LaTeX') is available from:
-% \begin{center}
-% 	\url{https://www.ctan.org/pkg/booktabs}
-% \end{center}
-%
-%
-% \begin{table}
-% 	\caption{Sample table title}
-% 	\centering
-% 	\begin{tabular}{lll}
-% 		\toprule
-% 		\multicolumn{2}{c}{Part}                   \\
-% 		\cmidrule(r){1-2}
-% 		Name     & Description     & Size ($\mu$m) \\
-% 		\midrule
-% 		Dendrite & Input terminal  & $\sim$100     \\
-% 		Axon     & Output terminal & $\sim$10      \\
-% 		Soma     & Cell body       & up to $10^6$  \\
-% 		\bottomrule
-% 	\end{tabular}
-% 	\label{tab:table}
-% \end{table}
-%
-% \subsection{Lists}
-% \begin{itemize}
-% 	\item Lorem ipsum dolor sit amet
-% 	\item consectetur adipiscing elit.
-% 	\item Aliquam dignissim blandit est, in dictum tortor gravida eget. In ac rutrum magna.
-% \end{itemize}
-
-
-\bibliographystyle{unsrtnat}
-\bibliography{references}  %%% Uncomment this line and comment out the ``thebibliography'' section below to use the external .bib file (using bibtex) .
-
-
-%%% Uncomment this section and comment out the \bibliography{references} line above to use inline references.
-% \begin{thebibliography}{1}
-
-% 	\bibitem{kour2014real}
-% 	George Kour and Raid Saabne.
-% 	\newblock Real-time segmentation of on-line handwritten arabic script.
-% 	\newblock In {\em Frontiers in Handwriting Recognition (ICFHR), 2014 14th
-% 			International Conference on}, pages 417--422. IEEE, 2014.
-
-% 	\bibitem{kour2014fast}
-% 	George Kour and Raid Saabne.
-% 	\newblock Fast classification of handwritten on-line arabic characters.
-% 	\newblock In {\em Soft Computing and Pattern Recognition (SoCPaR), 2014 6th
-% 			International Conference of}, pages 312--318. IEEE, 2014.
-
-% 	\bibitem{hadash2018estimate}
-% 	Guy Hadash, Einat Kermany, Boaz Carmeli, Ofer Lavi, George Kour, and Alon
-% 	Jacovi.
-% 	\newblock Estimate and replace: A novel approach to integrating deep neural
-% 	networks with existing applications.
-% 	\newblock {\em arXiv preprint arXiv:1804.09028}, 2018.
-
-% \end{thebibliography}
-
-\section{Appendix}
-
-% Please add the following required packages to your document preamble:
-\begin{table}[]
-    \begin{center}
-\begin{tabular}{@{}rlll@{}}
-\toprule
-\textbf{Number of matches} & \textbf{Citation Provenance} & \textbf{Match Status} & \textbf{Match Reason} \\ \midrule
-934932865                  & crossref                  & exact                 & doi                   \\
-151366108                  & fatcat-datacite           & exact                 & doi                   \\
-65345275                   & fatcat-pubmed             & exact                 & pmid                  \\
-48778607                   & fuzzy                     & strong                & jaccardauthors        \\
-42465250                   & grobid                    & exact                 & doi                   \\
-29197902                   & fatcat-pubmed             & exact                 & doi                   \\
-19996327                   & fatcat-crossref           & exact                 & doi                   \\
-11996694                   & fuzzy                     & strong                & slugtitleauthormatch  \\
-9157498                    & fuzzy                     & strong                & tokenizedauthors      \\
-3547594                    & grobid                    & exact                 & arxiv                 \\
-2310025                    & fuzzy                     & exact                 & titleauthormatch      \\
-1496515                    & grobid                    & exact                 & pmid                  \\
-680722                     & crossref                  & strong                & jaccardauthors        \\
-476331                     & fuzzy                     & strong                & versioneddoi          \\
-449271                     & grobid                    & exact                 & isbn                  \\
-230645                     & fatcat-crossref           & strong                & jaccardauthors        \\
-190578                     & grobid                    & strong                & jaccardauthors        \\
-156657                     & crossref                  & exact                 & isbn                  \\
-123681                     & fatcat-pubmed             & strong                & jaccardauthors        \\
-79328                      & crossref                  & exact                 & arxiv                 \\
-57414                      & crossref                  & strong                & tokenizedauthors      \\
-53480                      & fuzzy                     & strong                & pmiddoipair           \\
-52453                      & fuzzy                     & strong                & dataciterelatedid     \\
-47119                      & grobid                    & strong                & slugtitleauthormatch  \\
-36774                      & fuzzy                     & strong                & arxivversion          \\
-35311                      & fuzzy                     & strong                & customieeearxiv       \\
-33863                      & grobid                    & exact                 & pmcid                 \\
-23504                      & crossref                  & strong                & slugtitleauthormatch  \\
-22753                      & fatcat-crossref           & strong                & tokenizedauthors      \\
-17720                      & grobid                    & exact                 & titleauthormatch      \\
-14656                      & crossref                  & exact                 & titleauthormatch      \\
-14438                      & grobid                    & strong                & tokenizedauthors      \\
-7682                       & fatcat-crossref           & exact                 & arxiv                 \\
-5972                       & fatcat-crossref           & exact                 & isbn                  \\
-5525                       & fatcat-pubmed             & exact                 & arxiv                 \\
-4290                       & fatcat-pubmed             & strong                & tokenizedauthors      \\
-2745                       & fatcat-pubmed             & exact                 & isbn                  \\
-2342                       & fatcat-pubmed             & strong                & slugtitleauthormatch  \\
-2273                       & fatcat-crossref           & strong                & slugtitleauthormatch  \\
-1960                       & fuzzy                     & exact                 & workid                \\
-1150                       & fatcat-crossref           & exact                 & titleauthormatch      \\
-1041                       & fatcat-pubmed             & exact                 & titleauthormatch      \\
-895                        & fuzzy                     & strong                & figshareversion       \\
-317                        & fuzzy                     & strong                & titleartifact         \\
-82                         & grobid                    & strong                & titleartifact         \\
-33                         & crossref                  & strong                & titleartifact         \\
-5                          & fuzzy                     & strong                & custombsiundated      \\
-1                          & fuzzy                     & strong                & custombsisubdoc       \\
-1                          & fatcat                    & exact                 & doi                   \\ \bottomrule
-\end{tabular}
-    \vspace*{2mm}
-	\caption{Table of match counts, reference provenance, match status and
-match reason. The match reason identifier encode a specific rule in the domain
-dependent verification process and are included for completeness - we do not
-include the details of each rule in this report.}
-    \label{table:fields}
-\end{center}
-\end{table}
-
-
-\end{document}
author	Martin Czygan <martin.czygan@gmail.com>	2021-08-08 15:18:29 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-08-08 15:18:29 +0200
commit	bd66b58cded2c2c7e7b7e5d374434d6531dd70de (patch)
tree	00417812b9787ab4492e2c590fcf1bf6f4b576e7 /docs/TR-20210730212057-IA-WDS-CG/main.tex
parent	bb64b3aa62267676302e75f0ca44157b514beec4 (diff)
download	refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.tar.gz refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.zip