From 425a737c4635c96dad7a8fb296b4c1394874c54e Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Sat, 7 Aug 2021 05:15:15 +0200
Subject: wip: paper

---
 docs/Simple/main.pdf | Bin 91896 -> 91566 bytes
 docs/Simple/main.tex |  15 +++++----------
 2 files changed, 5 insertions(+), 10 deletions(-)

(limited to 'docs/Simple')

diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index 45cb024..3aac27e 100644
Binary files a/docs/Simple/main.pdf and b/docs/Simple/main.pdf differ
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index 1796902..9efe9c2 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -65,7 +65,7 @@ data obtained by PDF extraction tools such as
 GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
 metadata from Open Library and Wikipedia.
 The goal of this report is to describe briefly the current contents and the
-derivation of the dataset (refcat). We expect
+derivation of the dataset. We expect
 this dataset to be iterated upon, with changes both in content and processing.
 
 Modern citation indexes can be traced back to the early computing age, when
@@ -75,8 +75,8 @@ Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
 - the first version of which contained 6,325,178 individual
 references\citep{shotton2013publishing}. Other notable early projects
 include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
-decade has seen the emergance of more openly available, large scale
-citation projects, like Microsoft Academic\citep{sinha2015overview} or the
+decade has seen the emergence of more openly available, large scale
+citation projects like Microsoft Academic\citep{sinha2015overview} or the
 Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
 according to \citep{hutchins2021tipping} over 1B citations are publicly
 available, marking a tipping point for this category of data.
@@ -141,7 +141,6 @@ seen in~Table~\ref{table:cocicmp}.
         COCI (C)        &   1,094,394,688    \\
         \emph{refcat-doi} (R)   &   1,303,424,212    \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
         C $\cap$ R      &   1,007,539,966    \\
-        C $\cup$ R      &    1,390,278,521  \\
         C $\setminus$ R &      86,854,309  \\
         R $\setminus$ C & 295,884,246
     \end{tabular}
@@ -158,9 +157,6 @@ DOI prefix: 10.15468).}
 % zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
 % find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst
 
-TODO: how matches are established and a short note on overlap with COCI DOI.
-
-
 
 \section{System Design}
 
@@ -234,7 +230,7 @@ their expected or desired match status\footnote{The list can be found under:
 It is helpful to keep this test suite independent of any specific programming language.}.
 
 
-\section{Future Work}
+\section{Limitations and Future Work}
 
 As other dataset in this field we expect this dataset to be iterated upon.
 
@@ -258,10 +254,9 @@ As other dataset in this field we expect this dataset to be iterated upon.
         hand, this can hint at missing metadata. However, parts of the data
         will contain a reference to a catalogued entity, but in a specific,
         dense and harder to recover form.
-        This also include improvements to fuzzy matching code.
+        This also include improvements to the fuzzy matching approach.
     \end{itemize}
 
-
 \section{Acknowledgements}
 
 This work is partially supported by a grant from the \emph{Andrew W. Mellon
-- 
cgit v1.2.3