aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-09-07 20:45:35 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-09-07 20:45:35 +0200
commit9a75e6d549d36b68e7f58c9c1494a6d89071bf90 (patch)
tree3040b1f8be3fe3ed33ed2134ecc44b086a1d5ce4
parent022dd0cbeff3b80492556713c855df90b5384bf0 (diff)
downloadrefcat-9a75e6d549d36b68e7f58c9c1494a6d89071bf90.tar.gz
refcat-9a75e6d549d36b68e7f58c9c1494a6d89071bf90.zip
doc: tweaks
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.pdfbin104442 -> 104537 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex18
2 files changed, 8 insertions, 10 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 933338a..e4c1361 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
Binary files differ
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index 682a3bc..e99ddc3 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -98,7 +98,7 @@ strings partially describing a scholarly artifact.
\section{Related Work}
-Two typical problems that arise in the process of compiling a citation graph
+Two typical problems which arise in the process of compiling a citation graph
dataset are related to data aquisition and citation matching. Data acquisition
itself can take different forms: bibliographic metadata can contain explicit
reference data as provided by publishers and aggregators; this data can be
@@ -127,14 +127,10 @@ Projects centered around citations or containing citation data as a core
component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI
citations'', which was first released
2018-07-29\footnote{\url{https://opencitations.net/download}} and has been
-regularly updated~\citep{peroni2020opencitations}.
-
-The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+regularly updated~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
data to serve free knowledge'' continously adds citations to its
-database\footnote{\url{http://wikicite.org/statistics.html}}.
-
-Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
+database\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with \emph{PaperReferences} being one relation among many others.
@@ -225,6 +221,8 @@ seen in~Table~\ref{table:cocicmp}.
% zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
% find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst
+% TODO: some more numbers on the structure
+
\section{System Design}
@@ -278,7 +276,7 @@ PDF extraction. The bibliographic metadata is taken from fatcat, which itself
harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv,
Datacite, DOAJ, dblp and others into its catalog (as the source permits, data
is processed continously or in batches). Reference data from PDF documents has
-been extracted with GROBID\footnote{GROBID v0.5.5}, with the TEI-XML results
+been extracted with GROBID\footnote{GROBID \href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the TEI-XML results
being cached locally in a key-value store accessible with an S3 API. Archived
PDF documents result from dedicated web-scale crawls of scholarly domains
conducted with
@@ -321,8 +319,8 @@ framework\footnote{\url{https://github.com/spotify/luigi}~\citep{bernhardsson201
application, like~\citep{schulz2016use},~\citep{erdmann2017design},~\citep{lampa2019scipipe},~\citep{czygan2014design}
and others.} allows for experimentation in the pipeline and for single command
derivations, as data dependencies are encoded with the help of the
-orchestrator. Within the tasks, we also utilize classic platfrom tools such as
-sort~\citep{mcilroy1971research}.
+orchestrator. Within the tasks, we also utilize classic platform tools such as
+\emph{sort}~\citep{mcilroy1971research}.
With a few schema conversions, fuzzy matching can be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision