diff options
-rw-r--r-- | docs/TR-20210730212057-IA-WDS-CG/main.pdf | bin | 85870 -> 93706 bytes | |||
-rw-r--r-- | docs/TR-20210730212057-IA-WDS-CG/main.tex | 90 | ||||
-rw-r--r-- | docs/TR-20210730212057-IA-WDS-CG/references.bib | 34 |
3 files changed, 115 insertions, 9 deletions
diff --git a/docs/TR-20210730212057-IA-WDS-CG/main.pdf b/docs/TR-20210730212057-IA-WDS-CG/main.pdf Binary files differindex f02852a..5a4201b 100644 --- a/docs/TR-20210730212057-IA-WDS-CG/main.pdf +++ b/docs/TR-20210730212057-IA-WDS-CG/main.pdf diff --git a/docs/TR-20210730212057-IA-WDS-CG/main.tex b/docs/TR-20210730212057-IA-WDS-CG/main.tex index a6b5cd7..4b62d6c 100644 --- a/docs/TR-20210730212057-IA-WDS-CG/main.tex +++ b/docs/TR-20210730212057-IA-WDS-CG/main.tex @@ -89,11 +89,9 @@ composed of data gathered by the \href{https://fatcat.wiki}{fatcat cataloging pr web-scale crawls targeting primary and secondary scholarly outputs. In addition, relations are worked out between scholarly publications, web pages and their archived copies, books from the Open Library project as well as -Wikipedia articles. - -As of version "20210810", the graph consists of over X nodes +Wikipedia articles. This first version of the graph consists of over X nodes and over Y edges. We release this dataset under a Z open license under the -collection at \href{https://archive.org/details/citation\_graph}{https://archive.org/details/citation\_graph}, as well as all code +collection at \href{https://archive.org/details/TODO-citation\_graph}{https://archive.org/details/TODO-citation\_graph}, as well as all code used for derivation under an MIT license. \end{abstract} @@ -117,16 +115,90 @@ were first devised, living on in existing commercial knowledge bases today. Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the first version of which contained 6,325,178 individual references\citep{shotton2013publishing}. Other notable sources from that time -include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. +include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last +decade has seen an increase of more openly available reference dataset and +citation projects, like Microsoft Academic\citep{sinha2015overview} and +Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021, +according to \citep{hutchins2021tipping} over 1B citations are publicly +available, marking a tipping point for open citations. + +\section{Citation Graph Contents} + +% * edges +% * edges exact +% * edges fuzzy +% * edges fuzzy reason (table) +% * number of source docs +% * number of target docs +% * refs to papers +% * refs to books +% * refs to web pages +% * refs to web pages that have been archived +% * refs to web pages that have been archived but not on liveweb any more +% +% Overlaps +% +% * how many edges can be found in COCI as well +% * how many edges can be found in MAG as well +% * how many unique to us edges +% +% Additional numbers +% +% * number of unparsed refs +% * "biblio" field distribution of unparted refs +% +% Potential routes +% +% * journal abbreviation parsing with suffix arrays +% * lookup by name, year and journal + + +\section{System Design} + +TODO: describe limitations, single machine, prohibitive external data store +lookups, and performance advantages of stream processing; “miniature +map-reduce”, id based matching; fuzzy matching; funnel approach; data quality +issues; live system design (es, pg, …) + +The constraints for the system design are informed by the volume and the +variety of the data. In total, the raw inputs amount to about X TB uncompressed +textual data. More importantly, while the number of data fields is low, over Y +different combinations of fields are found in the raw reference data. Each +combination of fields may require a slightly different processing path. For +example, references with an arxiv identifier can be processed differently from +references with only a title. We identify about X types of manifestations which +in total amount for Y\% of the reference documents. + +Overall, a map-reduce style approach is followed, which e.g. allows for some +uniformity in the overall processing. We extract key value tuples (as TSV) from +the raw JSON data and sort by key. Finally we group pairs with the same key +into groups and apply a function of the elements of the group in order to +generate our target schema (biblioref, called bref, for short). + +The key derivation can be exact (e.g. an id like doi, pmid, etc) or based on a +normalization procedure, like a slugified title string. For id based matches we +can generate the bref schema directly. For fuzzy matching candidates, we pass +possible match pairs through a verification procedure, which is implemented for +documents of one specific catalog record schema. + +With a few schema conversions, fuzzy matching can be applied to Wikipedia +articles and Open Library editions as well. The aspect of precision and recall +are represented by the two stages: we are generous in the match candidate +generation phase in order to improve recall, but we are strict during +verification, in order to ensure precision. + +\section{Fuzzy Matching Approach} + +% Take sample of 100 docs, report some precision, recall, F1 on a hand curated small subset. +\section{Discussion} +% need to iterate -%\lipsum[2] -%\lipsum[3] +%\lipsum[2] %\lipsum[3] -% \section{Headings: first level} -% \label{sec:headings} +% \section{Headings: first level} % \label{sec:headings} % % \lipsum[4] See Section \ref{sec:headings}. % diff --git a/docs/TR-20210730212057-IA-WDS-CG/references.bib b/docs/TR-20210730212057-IA-WDS-CG/references.bib index 33ef997..cf61980 100644 --- a/docs/TR-20210730212057-IA-WDS-CG/references.bib +++ b/docs/TR-20210730212057-IA-WDS-CG/references.bib @@ -78,3 +78,37 @@ year={2019} } +@inproceedings{sinha2015overview, + title={An overview of microsoft academic service (mas) and applications}, + author={Sinha, Arnab and Shen, Zhihong and Song, Yang and Ma, Hao and Eide, Darrin and Hsu, Bo-June and Wang, Kuansan}, + booktitle={Proceedings of the 24th international conference on world wide web}, + pages={243--246}, + year={2015} +} + +@misc{i4oc, + title = {Initiative for Open Citations}, +howpublished = {\url{https://i4oc.org/}}, +note = {Accessed: 2021-07-30} +} + +@article{shotton2018funders, + title={Funders should mandate open citations.}, + author={Shotton, David}, + journal={Nature}, + volume={553}, + number={7686}, + pages={129--130}, + year={2018}, + publisher={Nature Publishing Group} +} + +@article{hutchins2021tipping, + title={A tipping point for open citation data}, + author={Hutchins, B Ian}, + journal={Quantitative Science Studies}, + pages={1--5}, + year={2021} +} + + |