From 425a737c4635c96dad7a8fb296b4c1394874c54e Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Sat, 7 Aug 2021 05:15:15 +0200 Subject: wip: paper --- docs/Simple/main.pdf | Bin 91896 -> 91566 bytes docs/Simple/main.tex | 15 +++++---------- 2 files changed, 5 insertions(+), 10 deletions(-) (limited to 'docs/Simple') diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf index 45cb024..3aac27e 100644 Binary files a/docs/Simple/main.pdf and b/docs/Simple/main.pdf differ diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex index 1796902..9efe9c2 100644 --- a/docs/Simple/main.tex +++ b/docs/Simple/main.tex @@ -65,7 +65,7 @@ data obtained by PDF extraction tools such as GROBID\cite{lopez2009grobid}. Additionally, we consider integration with metadata from Open Library and Wikipedia. The goal of this report is to describe briefly the current contents and the -derivation of the dataset (refcat). We expect +derivation of the dataset. We expect this dataset to be iterated upon, with changes both in content and processing. Modern citation indexes can be traced back to the early computing age, when @@ -75,8 +75,8 @@ Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the first version of which contained 6,325,178 individual references\citep{shotton2013publishing}. Other notable early projects include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last -decade has seen the emergance of more openly available, large scale -citation projects, like Microsoft Academic\citep{sinha2015overview} or the +decade has seen the emergence of more openly available, large scale +citation projects like Microsoft Academic\citep{sinha2015overview} or the Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021, according to \citep{hutchins2021tipping} over 1B citations are publicly available, marking a tipping point for this category of data. @@ -141,7 +141,6 @@ seen in~Table~\ref{table:cocicmp}. COCI (C) & 1,094,394,688 \\ \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst C $\cap$ R & 1,007,539,966 \\ - C $\cup$ R & 1,390,278,521 \\ C $\setminus$ R & 86,854,309 \\ R $\setminus$ C & 295,884,246 \end{tabular} @@ -158,9 +157,6 @@ DOI prefix: 10.15468).} % zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst % find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst -TODO: how matches are established and a short note on overlap with COCI DOI. - - \section{System Design} @@ -234,7 +230,7 @@ their expected or desired match status\footnote{The list can be found under: It is helpful to keep this test suite independent of any specific programming language.}. -\section{Future Work} +\section{Limitations and Future Work} As other dataset in this field we expect this dataset to be iterated upon. @@ -258,10 +254,9 @@ As other dataset in this field we expect this dataset to be iterated upon. hand, this can hint at missing metadata. However, parts of the data will contain a reference to a catalogued entity, but in a specific, dense and harder to recover form. - This also include improvements to fuzzy matching code. + This also include improvements to the fuzzy matching approach. \end{itemize} - \section{Acknowledgements} This work is partially supported by a grant from the \emph{Andrew W. Mellon -- cgit v1.2.3