wip: paper

author: Martin Czygan <martin.czygan@gmail.com> 2021-08-07 05:15:15 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-08-07 05:15:15 +0200
commit: 425a737c4635c96dad7a8fb296b4c1394874c54e (patch)
tree: b8c3949dc56342e37bacc24d32f6e4b6230e8dcd /docs
parent: 95f21aea779b304f4f43aed12ef27ee8c954b0c8 (diff)
download: refcat-425a737c4635c96dad7a8fb296b4c1394874c54e.tar.gz
refcat-425a737c4635c96dad7a8fb296b4c1394874c54e.zip
2 files changed, 5 insertions, 10 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index 45cb024..3aac27e 100644
--- a/docs/Simple/main.pdf
+++ b/docs/Simple/main.pdf
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index 1796902..9efe9c2 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -65,7 +65,7 @@ data obtained by PDF extraction tools such as
 GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
 metadata from Open Library and Wikipedia.
 The goal of this report is to describe briefly the current contents and the
-derivation of the dataset (refcat). We expect
+derivation of the dataset. We expect
 this dataset to be iterated upon, with changes both in content and processing.
 
 Modern citation indexes can be traced back to the early computing age, when
@@ -75,8 +75,8 @@ Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
 - the first version of which contained 6,325,178 individual
 references\citep{shotton2013publishing}. Other notable early projects
 include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
-decade has seen the emergance of more openly available, large scale
-citation projects, like Microsoft Academic\citep{sinha2015overview} or the
+decade has seen the emergence of more openly available, large scale
+citation projects like Microsoft Academic\citep{sinha2015overview} or the
 Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
 according to \citep{hutchins2021tipping} over 1B citations are publicly
 available, marking a tipping point for this category of data.
@@ -141,7 +141,6 @@ seen in~Table~\ref{table:cocicmp}.
         COCI (C)        &   1,094,394,688    \\
         \emph{refcat-doi} (R)   &   1,303,424,212    \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
         C $\cap$ R      &   1,007,539,966    \\
-        C $\cup$ R      &    1,390,278,521  \\
         C $\setminus$ R &      86,854,309  \\
         R $\setminus$ C & 295,884,246
     \end{tabular}
@@ -158,9 +157,6 @@ DOI prefix: 10.15468).}
 % zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
 % find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst
 
-TODO: how matches are established and a short note on overlap with COCI DOI.
-
-
 
 \section{System Design}
 
@@ -234,7 +230,7 @@ their expected or desired match status\footnote{The list can be found under:
 It is helpful to keep this test suite independent of any specific programming language.}.
 
 
-\section{Future Work}
+\section{Limitations and Future Work}
 
 As other dataset in this field we expect this dataset to be iterated upon.
 
@@ -258,10 +254,9 @@ As other dataset in this field we expect this dataset to be iterated upon.
         hand, this can hint at missing metadata. However, parts of the data
         will contain a reference to a catalogued entity, but in a specific,
         dense and harder to recover form.
-        This also include improvements to fuzzy matching code.
+        This also include improvements to the fuzzy matching approach.
     \end{itemize}
 
-
 \section{Acknowledgements}
 
 This work is partially supported by a grant from the \emph{Andrew W. Mellon
author	Martin Czygan <martin.czygan@gmail.com>	2021-08-07 05:15:15 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-08-07 05:15:15 +0200
commit	425a737c4635c96dad7a8fb296b4c1394874c54e (patch)
tree	b8c3949dc56342e37bacc24d32f6e4b6230e8dcd /docs
parent	95f21aea779b304f4f43aed12ef27ee8c954b0c8 (diff)
download	refcat-425a737c4635c96dad7a8fb296b4c1394874c54e.tar.gz refcat-425a737c4635c96dad7a8fb296b4c1394874c54e.zip