doc: tweaks

author: Martin Czygan <martin.czygan@gmail.com> 2021-09-07 20:45:35 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-09-07 20:45:35 +0200
commit: 9a75e6d549d36b68e7f58c9c1494a6d89071bf90 (patch)
tree: 3040b1f8be3fe3ed33ed2134ecc44b086a1d5ce4 /docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
parent: 022dd0cbeff3b80492556713c855df90b5384bf0 (diff)
download: refcat-9a75e6d549d36b68e7f58c9c1494a6d89071bf90.tar.gz
refcat-9a75e6d549d36b68e7f58c9c1494a6d89071bf90.zip
1 files changed, 8 insertions, 10 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index 682a3bc..e99ddc3 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -98,7 +98,7 @@ strings partially describing a scholarly artifact.
 
 \section{Related Work}
 
-Two typical problems that arise in the process of compiling a citation graph
+Two typical problems which arise in the process of compiling a citation graph
 dataset are related to data aquisition and citation matching. Data acquisition
 itself can take different forms: bibliographic metadata can contain explicit
 reference data as provided by publishers and aggregators; this data can be
@@ -127,14 +127,10 @@ Projects centered around citations or containing citation data as a core
 component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI
 citations'', which was first released
 2018-07-29\footnote{\url{https://opencitations.net/download}} and has been
-regularly updated~\citep{peroni2020opencitations}.
-
-The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+regularly updated~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
 ``a Wikimedia initiative to develop open citations and linked bibliographic
 data to serve free knowledge'' continously adds citations to its
-database\footnote{\url{http://wikicite.org/statistics.html}}.
-
-Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
+database\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
 entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
 with \emph{PaperReferences} being one relation among many others.
 
@@ -225,6 +221,8 @@ seen in~Table~\ref{table:cocicmp}.
 % zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
 % find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst
 
+% TODO: some more numbers on the structure
+
 
 \section{System Design}
 
@@ -278,7 +276,7 @@ PDF extraction. The bibliographic metadata is taken from fatcat, which itself
 harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv,
 Datacite, DOAJ, dblp and others into its catalog (as the source permits, data
 is processed continously or in batches). Reference data from PDF documents has
-been extracted with GROBID\footnote{GROBID v0.5.5}, with the TEI-XML results
+been extracted with GROBID\footnote{GROBID \href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the TEI-XML results
 being cached locally in a key-value store accessible with an S3 API. Archived
 PDF documents result from dedicated web-scale crawls of scholarly domains
 conducted with
@@ -321,8 +319,8 @@ framework\footnote{\url{https://github.com/spotify/luigi}~\citep{bernhardsson201
 	application, like~\citep{schulz2016use},~\citep{erdmann2017design},~\citep{lampa2019scipipe},~\citep{czygan2014design}
 	and others.} allows for experimentation in the pipeline and for single command
 derivations, as data dependencies are encoded with the help of the
-orchestrator. Within the tasks, we also utilize classic platfrom tools such as
-sort~\citep{mcilroy1971research}.
+orchestrator. Within the tasks, we also utilize classic platform tools such as
+\emph{sort}~\citep{mcilroy1971research}.
 
 With a few schema conversions, fuzzy matching can be applied to Wikipedia
 articles and Open Library (edition) records as well. The aspect of precision
author	Martin Czygan <martin.czygan@gmail.com>	2021-09-07 20:45:35 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-09-07 20:45:35 +0200
commit	9a75e6d549d36b68e7f58c9c1494a6d89071bf90 (patch)
tree	3040b1f8be3fe3ed33ed2134ecc44b086a1d5ce4 /docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
parent	022dd0cbeff3b80492556713c855df90b5384bf0 (diff)
download	refcat-9a75e6d549d36b68e7f58c9c1494a6d89071bf90.tar.gz refcat-9a75e6d549d36b68e7f58c9c1494a6d89071bf90.zip