aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-07 05:15:15 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-07 05:15:15 +0200
commit425a737c4635c96dad7a8fb296b4c1394874c54e (patch)
treeb8c3949dc56342e37bacc24d32f6e4b6230e8dcd
parent95f21aea779b304f4f43aed12ef27ee8c954b0c8 (diff)
downloadrefcat-425a737c4635c96dad7a8fb296b4c1394874c54e.tar.gz
refcat-425a737c4635c96dad7a8fb296b4c1394874c54e.zip
wip: paper
-rw-r--r--docs/Simple/main.pdfbin91896 -> 91566 bytes
-rw-r--r--docs/Simple/main.tex15
2 files changed, 5 insertions, 10 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index 45cb024..3aac27e 100644
--- a/docs/Simple/main.pdf
+++ b/docs/Simple/main.pdf
Binary files differ
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index 1796902..9efe9c2 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -65,7 +65,7 @@ data obtained by PDF extraction tools such as
GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
metadata from Open Library and Wikipedia.
The goal of this report is to describe briefly the current contents and the
-derivation of the dataset (refcat). We expect
+derivation of the dataset. We expect
this dataset to be iterated upon, with changes both in content and processing.
Modern citation indexes can be traced back to the early computing age, when
@@ -75,8 +75,8 @@ Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
references\citep{shotton2013publishing}. Other notable early projects
include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
-decade has seen the emergance of more openly available, large scale
-citation projects, like Microsoft Academic\citep{sinha2015overview} or the
+decade has seen the emergence of more openly available, large scale
+citation projects like Microsoft Academic\citep{sinha2015overview} or the
Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
according to \citep{hutchins2021tipping} over 1B citations are publicly
available, marking a tipping point for this category of data.
@@ -141,7 +141,6 @@ seen in~Table~\ref{table:cocicmp}.
COCI (C) & 1,094,394,688 \\
\emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
C $\cap$ R & 1,007,539,966 \\
- C $\cup$ R & 1,390,278,521 \\
C $\setminus$ R & 86,854,309 \\
R $\setminus$ C & 295,884,246
\end{tabular}
@@ -158,9 +157,6 @@ DOI prefix: 10.15468).}
% zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
% find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst
-TODO: how matches are established and a short note on overlap with COCI DOI.
-
-
\section{System Design}
@@ -234,7 +230,7 @@ their expected or desired match status\footnote{The list can be found under:
It is helpful to keep this test suite independent of any specific programming language.}.
-\section{Future Work}
+\section{Limitations and Future Work}
As other dataset in this field we expect this dataset to be iterated upon.
@@ -258,10 +254,9 @@ As other dataset in this field we expect this dataset to be iterated upon.
hand, this can hint at missing metadata. However, parts of the data
will contain a reference to a catalogued entity, but in a specific,
dense and harder to recover form.
- This also include improvements to fuzzy matching code.
+ This also include improvements to the fuzzy matching approach.
\end{itemize}
-
\section{Acknowledgements}
This work is partially supported by a grant from the \emph{Andrew W. Mellon