diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | bin | 97899 -> 104607 bytes | |||
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 78 | ||||
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib | 52 |
3 files changed, 115 insertions, 15 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf Binary files differindex 6ff65d0..608cf96 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index 2a60a77..e2f59a0 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -79,8 +79,8 @@ projects like the Science Citation Index (1955)~\citep{garfield2007evolution} were first devised, living on in existing commercial knowledge bases today. Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the first version of which contained 6,325,178 individual -references~\citep{shotton2013publishing}. Other notable early projects -include CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last +references~\citep{shotton2013publishing}. Other notable projects +include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last decade has seen the emergence of more openly available, large scale citation projects like Microsoft Academic~\citep{sinha2015overview} or the Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}. @@ -94,22 +94,22 @@ manually or through more automated methods, such as metadata access and structured data extraction from full text documents; the latter offering the benefits of scalability. The completeness of bibliographic metadata ranges from documents with one or more persistant identifiers to raw, potentially unclean -strings partially describing a publication. +strings partially describing a scholarly artifact. \section{Related Work} -Typical problems arising in the process of compiling a citation graph dataset -are data aquisition and citation matching. Data acquisition itself can take -different forms: bibliographic metadata can contain explicit reference data as -provided by publishers and aggregators; this data can be relatively consistent -when looked at per source, but may vary in style and comprehensiveness when -looked at as a whole. Another way of acquiring bibliographic metadata is to -analyze a source document, such as a PDF (or its text), directly. Tools in this -category are often based on conditial random +Two typical problems that arise in the process of compiling a citation graph +dataset are related to data aquisition and citation matching. Data acquisition +itself can take different forms: bibliographic metadata can contain explicit +reference data as provided by publishers and aggregators; this data can be +relatively consistent when looked at per source, but may vary in style and +comprehensiveness when looked at as a whole. Another way of acquiring +bibliographic metadata is to analyze a source document, such as a PDF (or its +text), directly. Tools in this category are often based on conditial random fields~\citep{lafferty2001conditional} and have been implemented in projects such as ParsCit~\citep{councill2008parscit}, -Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} -or GROBID~\citep{lopez2009grobid}. +Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} or +GROBID~\citep{lopez2009grobid}. The problem of citation matching is relatively simple when common, persistent identifiers are present in the data. Complications mount, when there is @@ -123,7 +123,20 @@ citation matching process is done at scale~\citep{fedoryszak2013large}. The problem of heterogenity has been discussed in the context of datasets by~\citep{mathiak2015challenges}. +Projects centered around citations or containing citation data as a core +component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI +citations'', which was first released +2018-07-29\footnote{\url{https://opencitations.net/download}} and has been +regularly updated~\citep{peroni2020opencitations}. +The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, +``a Wikimedia initiative to develop open citations and linked bibliographic +data to serve free knowledge'' continously adds citations to its +database\footnote{\url{http://wikicite.org/statistics.html}}. + +Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of +entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} +with \emph{PaperReferences} being one relation among many others. % There are a few large scale citation dataset available today. COCI, the @@ -215,6 +228,8 @@ seen in~Table~\ref{table:cocicmp}. \section{System Design} +\subsection{Constraints} + The constraints for the systems design are informed by the volume and the variety of the data. The capability to run the whole graph derivation on a single machine was a minor goal as well. In total, the raw inputs amount to a @@ -256,6 +271,26 @@ Table~\ref{table:fields}. \end{center} \end{table} +\subsection{Data Sources} + +Reference data comes from two main sources: explicit bibliographic metadata and +PDF extraction. The bibliographic metadata is taken from fatcat, which itself +harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv, +Datacite, DOAJ, dblp and others into its catalog (as the source permits, data +is processed continously or in batches). Reference data from PDF documents has +been extracted with GROBID\footnote{GROBID v0.5.5}, with the TEI-XML results +being cached locally in a key-value store accessible with an S3 API. Archived +PDF documents result from dedicated web-scale crawls of scholarly domains +conducted with +Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} and a +variety of seed lists targeting journal homepages, repositories, dataset +providers, aggregators, web archives and other venues. A processing pipeline +merges catalog data from the primary database and cached values in key-value +stores and generates the set of about 2.5B references documents, which +currently serve as an input for the citation graph derivation pipeline. + +\subsection{Methodology} + Overall, a map-reduce style~\citep{dean2010mapreduce} approach is followed\footnote{While the operations are similar, the processing is not distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows @@ -264,7 +299,7 @@ uniformity in the overall processing. We extract (key, document) tuples (as TSV) from the raw JSON data and sort by key. We then group documents with the same key and apply a function on each group in order to generate our target schema or perform -additional operations such as deduplication or fusion of matched and unmatched references. +additional operations such as deduplication or fusion of matched and unmatched references for indexing. The key derivation can be exact (via an identifier like DOI, PMID, etc) or based on a value normalization, like ``slugifying'' a title string. For identifier @@ -277,6 +312,18 @@ are similar by various metrics calculated over title and author fields. The fuzz approach is applied on all reference documents without identifier (a title is currently required). +We currently implement performance sensitive parts in +Go\footnote{\url{https://golang.org/}}, with various processing stages (e.g. +conversion, map, reduce, ...) represented by separate command line tools. A +thin task orchestration layer using the luigi +framework\footnote{\url{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani}, + which has been used in various scientific pipeline + application, like~\citep{schulz2016use},~\citep{erdmann2017design},~\citep{lampa2019scipipe},~\citep{czygan2014design} + and others.} allows for experimentation in the pipeline and for single command +derivations, as data dependencies are encoded with the help of the +orchestrator. Within the tasks, we also utilize classic platfrom tools such as +sort~\citep{mcilroy1971research}. + With a few schema conversions, fuzzy matching can be applied to Wikipedia articles and Open Library (edition) records as well. The aspect of precision and recall are represented by the two stages: we are generous in the match @@ -288,6 +335,7 @@ their expected or desired match status\footnote{The list can be found under: It is helpful to keep this test suite independent of any specific programming language.}. + \section{Limitations and Future Work} As other dataset in this field we expect this dataset to be iterated upon. @@ -295,7 +343,7 @@ As other dataset in this field we expect this dataset to be iterated upon. \begin{itemize} \item The fatcat catalog updates its metadata continously\footnote{A changelog can currenly be followed here: - \url{https://fatcat.wiki/changelog}} and web crawls are conducted + \url{https://fatcat.wiki/changelog}.} and web crawls are conducted regularly. Current processing pipelines cover raw reference snapshot creation and derivation of the graph structure, which allows to rerun processing based on updated data as it becomes available. diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib b/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib index 9cfb32b..e679974 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib @@ -347,3 +347,55 @@ year={1998} } +@article{schulz2016use, + title={Use of application containers and workflows for genomic data analysis}, + author={Schulz, Wade L and Durant, Thomas JS and Siddon, Alexa J and Torres, Richard}, + journal={Journal of pathology informatics}, + volume={7}, + year={2016}, + publisher={Wolters Kluwer--Medknow Publications} +} + +@inproceedings{erdmann2017design, + title={Design and Execution of make-like, distributed Analyses based on Spotify’s Pipelining Package Luigi}, + author={Erdmann, M and Fischer, B and Fischer, R and Rieger, M}, + booktitle={Journal of Physics: Conference Series}, + volume={898}, + number={7}, + pages={072047}, + year={2017}, + organization={IOP Publishing} +} + +@misc{bernhardsson2018rouhani, + title={Rouhani A. spotify/luigi-GitHub}, + author={Bernhardsson, E and Freider, E}, + year={2018} +} + +@article{lampa2019scipipe, + title={SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines}, + author={Lampa, Samuel and Dahl{\"o}, Martin and Alvarsson, Jonathan and Spjuth, Ola}, + journal={GigaScience}, + volume={8}, + number={5}, + pages={giz044}, + year={2019}, + publisher={Oxford University Press} +} + +@article{czygan2014design, + title={Design and implementation of a library metadata management framework and its application in fuzzy data deduplication and data reconciliation with authority data}, + author={Czygan, Martin}, + journal={Informatik 2014}, + year={2014}, + publisher={Gesellschaft f{\"u}r Informatik eV} +} + +@article{mcilroy1971research, + title={A Research Unix reader: annotated excerpts from the Programmer’s Manual}, + author={McIlroy, M Douglas}, + year={1971}, + publisher={1971-1986} +} + |