aboutsummaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-08 15:18:29 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-08 15:18:29 +0200
commitbd66b58cded2c2c7e7b7e5d374434d6531dd70de (patch)
tree00417812b9787ab4492e2c590fcf1bf6f4b576e7 /docs
parentbb64b3aa62267676302e75f0ca44157b514beec4 (diff)
downloadrefcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.tar.gz
refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.zip
docs: cleanup and naming
Diffstat (limited to 'docs')
-rw-r--r--docs/TR-20210730212057-IA-WDS-CG/.gitignore5
-rw-r--r--docs/TR-20210730212057-IA-WDS-CG/Makefile9
-rw-r--r--docs/TR-20210730212057-IA-WDS-CG/README.md49
-rw-r--r--docs/TR-20210730212057-IA-WDS-CG/arxiv.sty262
-rw-r--r--docs/TR-20210730212057-IA-WDS-CG/main.pdfbin99346 -> 0 bytes
-rw-r--r--docs/TR-20210730212057-IA-WDS-CG/main.tex442
-rw-r--r--docs/TR-20210730212057-IA-WDS-CG/references.bib123
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/.gitignore (renamed from docs/Simple/.gitignore)0
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/LICENSE (renamed from docs/Simple/LICENSE)0
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/Makefile (renamed from docs/Simple/Makefile)0
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/README.md (renamed from docs/Simple/README.md)0
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/figure.pdf (renamed from docs/Simple/figure.pdf)bin215353 -> 215353 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf (renamed from docs/Simple/main.pdf)bin95636 -> 95636 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex (renamed from docs/Simple/main.tex)0
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib (renamed from docs/Simple/refs.bib)0
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/simpleConference.sty (renamed from docs/Simple/simpleConference.sty)0
16 files changed, 0 insertions, 890 deletions
diff --git a/docs/TR-20210730212057-IA-WDS-CG/.gitignore b/docs/TR-20210730212057-IA-WDS-CG/.gitignore
deleted file mode 100644
index 5040d53..0000000
--- a/docs/TR-20210730212057-IA-WDS-CG/.gitignore
+++ /dev/null
@@ -1,5 +0,0 @@
-*.log
-*.aux
-*.bbl
-*.blg
-*.out
diff --git a/docs/TR-20210730212057-IA-WDS-CG/Makefile b/docs/TR-20210730212057-IA-WDS-CG/Makefile
deleted file mode 100644
index 9996575..0000000
--- a/docs/TR-20210730212057-IA-WDS-CG/Makefile
+++ /dev/null
@@ -1,9 +0,0 @@
-main.pdf: main.tex
- pdflatex main.tex
- bibtex main
- pdflatex main.tex
-
-
-.PHONY: clean
-clean:
- rm -f main.pdf
diff --git a/docs/TR-20210730212057-IA-WDS-CG/README.md b/docs/TR-20210730212057-IA-WDS-CG/README.md
deleted file mode 100644
index 54de590..0000000
--- a/docs/TR-20210730212057-IA-WDS-CG/README.md
+++ /dev/null
@@ -1,49 +0,0 @@
-
-## Description:
-
-The project hosts an aesthetic and simple LaTeX style suitable for "preprint" publications such as arXiv and bio-arXiv, etc.
-It is based on the [**nips_2018.sty**](https://media.nips.cc/Conferences/NIPS2018/Styles/nips_2018.sty) style.
-
-This styling maintains the esthetic of NIPS but adding and changing features to make it (IMO) even better and more suitable for preprints.
-The result looks fairly different from NIPS style so that readers won't get confused to think that the preprint was published in NIPS.
-
-### Why NIPS?
-Because the NIPS styling is a comfortable single column format that is very esthetic and convenient for reading.
-
-## Usage:
-1. Use Document class **article**.
-2. Copy **arxiv.sty** to the folder containing your tex file.
-3. add `\usepackage{arxiv}` after `\documentclass{article}`.
-4. The only packages used in the style file are **geometry** and **fancyheader**. Do not reimport them.
-
-See **template.tex**
-
-## Project files:
-1. **arxiv.sty** - the style file.
-2. **template.tex** - a sample template that uses the **arxiv style**.
-3. **references.bib** - the bibliography source file for template.tex.
-4. **template.pdf** - a sample output of the template file that demonstrated the design provided by the arxiv style.
-
-
-## Handling References when submitting to arXiv.org
-The most convenient way to manage references is using an external BibTeX file and pointing to it from the main file.
-However, this requires running the [bibtex](http://www.bibtex.org/) tool to "compile" the `.bib` file and create `.bbl` file containing "bibitems" that can be directly inserted in the main tex file.
-However, unfortunately the arXiv Tex environment ([Tex Live](https://www.tug.org/texlive/)) do not do that.
-So easiest way when submitting to arXiv is to create a single self-contained .tex file that contains the references.
-This can be done by running the BibTeX command on your machine and insert the content of the generated `.bbl` file into the `.tex` file and commenting out the `\bibliography{references}` that point to the external references file.
-
-Below are the commands that should be run in the project folder:
-1. Run `$ latex template`
-2. Run `$ bibtex template`
-3. A `template.bbl` file will be generated (make sure it is there)
-4. Copy the `template.bbl` file content to `template.tex` into the `\begin{thebibliography}` command.
-5. Comment out the `\bibliography{references}` command in `template.tex`.
-6. You ready to submit to arXiv.org.
-
-
-## General Notes:
-1. For help, comments, praises, bug reporting or change requests, you can contact the author at: kourgeorge/at/gmail.com.
-2. You can use, redistribute and do whatever with this project, however, the author takes no responsibility on whatever usage of this project.
-3. If you start another project based on this project, it would be nice to mention/link to this project.
-4. You are very welcome to contribute to this project.
-5. A good looking 2 column template can be found in https://github.com/brenhinkeller/preprint-template.tex.
diff --git a/docs/TR-20210730212057-IA-WDS-CG/arxiv.sty b/docs/TR-20210730212057-IA-WDS-CG/arxiv.sty
deleted file mode 100644
index ccb7feb..0000000
--- a/docs/TR-20210730212057-IA-WDS-CG/arxiv.sty
+++ /dev/null
@@ -1,262 +0,0 @@
-\NeedsTeXFormat{LaTeX2e}
-
-\ProcessOptions\relax
-
-% fonts
-\renewcommand{\rmdefault}{ptm}
-\renewcommand{\sfdefault}{phv}
-
-% set page geometry
-\usepackage[verbose=true,letterpaper]{geometry}
-\AtBeginDocument{
- \newgeometry{
- textheight=9in,
- textwidth=6.5in,
- top=1in,
- headheight=14pt,
- headsep=25pt,
- footskip=30pt
- }
-}
-
-\widowpenalty=10000
-\clubpenalty=10000
-\flushbottom
-\sloppy
-
-
-
-\newcommand{\headeright}{A Preprint}
-\newcommand{\undertitle}{A Preprint}
-\newcommand{\shorttitle}{\@title}
-
-\usepackage{fancyhdr}
-\fancyhf{}
-\pagestyle{fancy}
-\renewcommand{\headrulewidth}{0.4pt}
-\fancyheadoffset{0pt}
-\rhead{\scshape \footnotesize \headeright}
-\chead{\shorttitle}
-\cfoot{\thepage}
-
-
-%Handling Keywords
-\def\keywordname{{\bfseries \emph{Keywords}}}%
-\def\keywords#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm
-\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$
-}\noindent\keywordname\enspace\ignorespaces#1\par}}
-
-% font sizes with reduced leading
-\renewcommand{\normalsize}{%
- \@setfontsize\normalsize\@xpt\@xipt
- \abovedisplayskip 7\p@ \@plus 2\p@ \@minus 5\p@
- \abovedisplayshortskip \z@ \@plus 3\p@
- \belowdisplayskip \abovedisplayskip
- \belowdisplayshortskip 4\p@ \@plus 3\p@ \@minus 3\p@
-}
-\normalsize
-\renewcommand{\small}{%
- \@setfontsize\small\@ixpt\@xpt
- \abovedisplayskip 6\p@ \@plus 1.5\p@ \@minus 4\p@
- \abovedisplayshortskip \z@ \@plus 2\p@
- \belowdisplayskip \abovedisplayskip
- \belowdisplayshortskip 3\p@ \@plus 2\p@ \@minus 2\p@
-}
-\renewcommand{\footnotesize}{\@setfontsize\footnotesize\@ixpt\@xpt}
-\renewcommand{\scriptsize}{\@setfontsize\scriptsize\@viipt\@viiipt}
-\renewcommand{\tiny}{\@setfontsize\tiny\@vipt\@viipt}
-\renewcommand{\large}{\@setfontsize\large\@xiipt{14}}
-\renewcommand{\Large}{\@setfontsize\Large\@xivpt{16}}
-\renewcommand{\LARGE}{\@setfontsize\LARGE\@xviipt{20}}
-\renewcommand{\huge}{\@setfontsize\huge\@xxpt{23}}
-\renewcommand{\Huge}{\@setfontsize\Huge\@xxvpt{28}}
-
-% sections with less space
-\providecommand{\section}{}
-\renewcommand{\section}{%
- \@startsection{section}{1}{\z@}%
- {-2.0ex \@plus -0.5ex \@minus -0.2ex}%
- { 1.5ex \@plus 0.3ex \@minus 0.2ex}%
- {\large\bf\raggedright}%
-}
-\providecommand{\subsection}{}
-\renewcommand{\subsection}{%
- \@startsection{subsection}{2}{\z@}%
- {-1.8ex \@plus -0.5ex \@minus -0.2ex}%
- { 0.8ex \@plus 0.2ex}%
- {\normalsize\bf\raggedright}%
-}
-\providecommand{\subsubsection}{}
-\renewcommand{\subsubsection}{%
- \@startsection{subsubsection}{3}{\z@}%
- {-1.5ex \@plus -0.5ex \@minus -0.2ex}%
- { 0.5ex \@plus 0.2ex}%
- {\normalsize\bf\raggedright}%
-}
-\providecommand{\paragraph}{}
-\renewcommand{\paragraph}{%
- \@startsection{paragraph}{4}{\z@}%
- {1.5ex \@plus 0.5ex \@minus 0.2ex}%
- {-1em}%
- {\normalsize\bf}%
-}
-\providecommand{\subparagraph}{}
-\renewcommand{\subparagraph}{%
- \@startsection{subparagraph}{5}{\z@}%
- {1.5ex \@plus 0.5ex \@minus 0.2ex}%
- {-1em}%
- {\normalsize\bf}%
-}
-\providecommand{\subsubsubsection}{}
-\renewcommand{\subsubsubsection}{%
- \vskip5pt{\noindent\normalsize\rm\raggedright}%
-}
-
-% float placement
-\renewcommand{\topfraction }{0.85}
-\renewcommand{\bottomfraction }{0.4}
-\renewcommand{\textfraction }{0.1}
-\renewcommand{\floatpagefraction}{0.7}
-
-\newlength{\@abovecaptionskip}\setlength{\@abovecaptionskip}{7\p@}
-\newlength{\@belowcaptionskip}\setlength{\@belowcaptionskip}{\z@}
-
-\setlength{\abovecaptionskip}{\@abovecaptionskip}
-\setlength{\belowcaptionskip}{\@belowcaptionskip}
-
-% swap above/belowcaptionskip lengths for tables
-\renewenvironment{table}
- {\setlength{\abovecaptionskip}{\@belowcaptionskip}%
- \setlength{\belowcaptionskip}{\@abovecaptionskip}%
- \@float{table}}
- {\end@float}
-
-% footnote formatting
-\setlength{\footnotesep }{6.65\p@}
-\setlength{\skip\footins}{9\p@ \@plus 4\p@ \@minus 2\p@}
-\renewcommand{\footnoterule}{\kern-3\p@ \hrule width 12pc \kern 2.6\p@}
-\setcounter{footnote}{0}
-
-% paragraph formatting
-\setlength{\parindent}{\z@}
-\setlength{\parskip }{5.5\p@}
-
-% list formatting
-\setlength{\topsep }{4\p@ \@plus 1\p@ \@minus 2\p@}
-\setlength{\partopsep }{1\p@ \@plus 0.5\p@ \@minus 0.5\p@}
-\setlength{\itemsep }{2\p@ \@plus 1\p@ \@minus 0.5\p@}
-\setlength{\parsep }{2\p@ \@plus 1\p@ \@minus 0.5\p@}
-\setlength{\leftmargin }{3pc}
-\setlength{\leftmargini }{\leftmargin}
-\setlength{\leftmarginii }{2em}
-\setlength{\leftmarginiii}{1.5em}
-\setlength{\leftmarginiv }{1.0em}
-\setlength{\leftmarginv }{0.5em}
-\def\@listi {\leftmargin\leftmargini}
-\def\@listii {\leftmargin\leftmarginii
- \labelwidth\leftmarginii
- \advance\labelwidth-\labelsep
- \topsep 2\p@ \@plus 1\p@ \@minus 0.5\p@
- \parsep 1\p@ \@plus 0.5\p@ \@minus 0.5\p@
- \itemsep \parsep}
-\def\@listiii{\leftmargin\leftmarginiii
- \labelwidth\leftmarginiii
- \advance\labelwidth-\labelsep
- \topsep 1\p@ \@plus 0.5\p@ \@minus 0.5\p@
- \parsep \z@
- \partopsep 0.5\p@ \@plus 0\p@ \@minus 0.5\p@
- \itemsep \topsep}
-\def\@listiv {\leftmargin\leftmarginiv
- \labelwidth\leftmarginiv
- \advance\labelwidth-\labelsep}
-\def\@listv {\leftmargin\leftmarginv
- \labelwidth\leftmarginv
- \advance\labelwidth-\labelsep}
-\def\@listvi {\leftmargin\leftmarginvi
- \labelwidth\leftmarginvi
- \advance\labelwidth-\labelsep}
-
-% create title
-\providecommand{\maketitle}{}
-\renewcommand{\maketitle}{%
- \par
- \begingroup
- \renewcommand{\thefootnote}{\fnsymbol{footnote}}
- % for perfect author name centering
- \renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}}
- % The footnote-mark was overlapping the footnote-text,
- % added the following to fix this problem (MK)
- \long\def\@makefntext##1{%
- \parindent 1em\noindent
- \hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1
- }
- \thispagestyle{empty}
- \@maketitle
- \@thanks
- %\@notice
- \endgroup
- \let\maketitle\relax
- \let\thanks\relax
-}
-
-% rules for title box at top of first page
-\newcommand{\@toptitlebar}{
- \hrule height 2\p@
- \vskip 0.25in
- \vskip -\parskip%
-}
-\newcommand{\@bottomtitlebar}{
- \vskip 0.29in
- \vskip -\parskip
- \hrule height 2\p@
- \vskip 0.09in%
-}
-
-% create title (includes both anonymized and non-anonymized versions)
-\providecommand{\@maketitle}{}
-\renewcommand{\@maketitle}{%
- \vbox{%
- \hsize\textwidth
- \linewidth\hsize
- \vskip 0.1in
- \@toptitlebar
- \centering
- {\LARGE\sc \@title\par}
- \@bottomtitlebar
- \textsc{\undertitle}\\
- \vskip 0.1in
- \def\And{%
- \end{tabular}\hfil\linebreak[0]\hfil%
- \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
- }
- \def\AND{%
- \end{tabular}\hfil\linebreak[4]\hfil%
- \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
- }
- \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\@author\end{tabular}%
- \vskip 0.4in \@minus 0.1in \center{\@date} \vskip 0.2in
- }
-}
-
-% add conference notice to bottom of first page
-\newcommand{\ftype@noticebox}{8}
-\newcommand{\@notice}{%
- % give a bit of extra room back to authors on first page
- \enlargethispage{2\baselineskip}%
- \@float{noticebox}[b]%
- \footnotesize\@noticestring%
- \end@float%
-}
-
-% abstract styling
-\renewenvironment{abstract}
-{
- \centerline
- {\large \bfseries \scshape Abstract}
- \begin{quote}
-}
-{
- \end{quote}
-}
-
-\endinput
diff --git a/docs/TR-20210730212057-IA-WDS-CG/main.pdf b/docs/TR-20210730212057-IA-WDS-CG/main.pdf
deleted file mode 100644
index c8bb5a3..0000000
--- a/docs/TR-20210730212057-IA-WDS-CG/main.pdf
+++ /dev/null
Binary files differ
diff --git a/docs/TR-20210730212057-IA-WDS-CG/main.tex b/docs/TR-20210730212057-IA-WDS-CG/main.tex
deleted file mode 100644
index a7edac3..0000000
--- a/docs/TR-20210730212057-IA-WDS-CG/main.tex
+++ /dev/null
@@ -1,442 +0,0 @@
-\documentclass{article}
-
-
-
-\usepackage{arxiv}
-
-\usepackage[utf8]{inputenc} % allow utf-8 input
-\usepackage[T1]{fontenc} % use 8-bit T1 fonts
-\usepackage{hyperref} % hyperlinks
-\usepackage{url} % simple URL typesetting
-\usepackage{booktabs} % professional-quality tables
-\usepackage{amsfonts} % blackboard math symbols
-\usepackage{nicefrac} % compact symbols for 1/2, etc.
-\usepackage{microtype} % microtypography
-\usepackage{lipsum} % Can be removed after putting your text content
-\usepackage{graphicx}
-\usepackage{natbib}
-\usepackage{doi}
-
-\title{Internet Archive Scholar Citation Graph Dataset}
-
-\date{August 10, 2021} % Here you can change the date presented in the paper title
-%\date{} % Or removing it
-
-\author{ Martin Czygan \\
- Internet Archive\\
- San Francisco, CA 94118 \\
- \texttt{martin@archive.org} \\
- %% examples of more authors
- \And
- Bryan Newbold \\
- Internet Archive\\
- San Francisco, CA 94118 \\
- \texttt{bnewbold@archive.org} \\
- % \And
- % Helge Holzmann \\
- % Internet Archive\\
- % San Francisco, CA 94118 \\
- % \texttt{helge@archive.org} \\
- % \And
- % Jefferson Bailey \\
- % Internet Archive\\
- % San Francisco, CA 94118 \\
- % \texttt{jefferson@archive.org} \\
- %% \AND
- %% Coauthor \\
- %% Affiliation \\
- %% Address \\
- %% \texttt{email} \\
- %% \And
- %% Coauthor \\
- %% Affiliation \\
- %% Address \\
- %% \texttt{email} \\
- %% \And
- %% Coauthor \\
- %% Affiliation \\
- %% Address \\
- %% \texttt{email} \\
-}
-
-% Uncomment to remove the date
-%\date{}
-
-% Uncomment to override the `A preprint' in the header
-\renewcommand{\headeright}{Technical Report}
-\renewcommand{\undertitle}{Technical Report}
-% \renewcommand{\shorttitle}{\textit{arXiv} Template}
-
-%%% Add PDF metadata to help others organize their library
-%%% Once the PDF is generated, you can check the metadata with
-%%% $ pdfinfo template.pdf
-\hypersetup{
-pdftitle={Internet Archive Scholar Citation Graph Dataset},
-pdfsubject={cs.DL, cs.IR},
-pdfauthor={Martin Czygan, Bryan Newbold, Helge Holzmann, Jefferson Bailey},
-pdfkeywords={Web Archiving, Citation Graph},
-}
-
-\begin{document}
-\maketitle
-
-\begin{abstract}
-As part of its scholarly data efforts, the Internet Archive releases a citation
-graph dataset derived from scholarly publications and additional data sources. It is
-composed of data gathered by the \href{https://fatcat.wiki}{fatcat cataloging project} and related
-web-scale crawls targeting primary and secondary scholarly outputs. In
-addition, relations are worked out between scholarly publications, web pages
-and their archived copies, books from the Open Library project as well as
-Wikipedia articles. This first version of the graph consists of over X nodes
-and over Y edges. We release this dataset under a Z open license under the
-collection at \href{https://archive.org/details/TODO-citation\_graph}{https://archive.org/details/TODO-citation\_graph}, as well as all code
-used for derivation under an MIT license.
-\end{abstract}
-
-
-% keywords can be removed
-\keywords{Citation Graph \and Scholarly Communications \and Web Archiving}
-
-
-\section{Introduction}
-
-The Internet Archive releases a first version of a citation graph dataset
-derived from a raw corpus of about 2.5B references gathered from metadata and
-from data obtained by PDF extraction tools such as GROBID\citep{lopez2009grobid}.
-The goal of this report is to describe briefly the current contents and the
-derivation of the Archive Scholar Citations Dataset (ASC). We expect
-this dataset to be iterated upon, with changes both in content and processing.
-
-Modern citation indexes can be traced back to the early computing age, when
-projects like the Science Citation Index (1955)\citep{garfield2007evolution}
-were first devised, living on in existing commercial knowledge bases today.
-Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
-- the first version of which contained 6,325,178 individual
-references\citep{shotton2013publishing}. Other notable sources from that time
-include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
-decade has seen an increase of more openly available reference dataset and
-citation projects, like Microsoft Academic\citep{sinha2015overview} and
-Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
-according to \citep{hutchins2021tipping} over 1B citations are publicly
-available, marking a tipping point for open citations.
-
-
-
-\section{Citation Graph Contents}
-
-
-
-% * edges
-% * edges exact
-% * edges fuzzy
-% * edges fuzzy reason (table)
-% * number of source docs
-% * number of target docs
-% * refs to papers
-% * refs to books
-% * refs to web pages
-% * refs to web pages that have been archived
-% * refs to web pages that have been archived but not on liveweb any more
-%
-% Overlaps
-%
-% * how many edges can be found in COCI as well
-% * how many edges can be found in MAG as well
-% * how many unique to us edges
-%
-% Additional numbers
-%
-% * number of unparsed refs
-% * "biblio" field distribution of unparted refs
-%
-% Potential routes
-%
-% * journal abbreviation parsing with suffix arrays
-% * lookup by name, year and journal
-
-
-\section{System Design}
-
-The constraints for the systems design are informed by the volume and the
-variety of the data. In total, the raw inputs amount to a few TB of textual
-content, mostly newline delimited JSON. More importantly, while the number of
-data fields is low, certain schemas are very partial with hundreds of different
-combinations of available field values found in the raw reference data. This is
-most likely caused by aggregators passing on reference data coming from
-hundreds of sources, each of which not necessarily agreeing on a common
-granularity for citation data and from artifacts of machine learning based
-structured data extraction tools.
-
-Each combination of fields may require a slightly different processing path.
-For example, references with an Arxiv identifier can be processed differently
-from references with only a title. Over 50\% of the raw reference data comes
-from a set of eight field manifestations, as listed in
-Table~\ref{table:fields}.
-
-\begin{table}[]
- \begin{center}
- \begin{tabular}{ll}
-\toprule
- \bf{Fields} & \bf{Share} \\
-\midrule
- \multicolumn{1}{l}{CN|CRN|P|T|U|V|Y} & 14\% \\
- \multicolumn{1}{l}{DOI} & 14\% \\
- \multicolumn{1}{l}{CN|CRN|IS|P|T|U|V|Y} & 5\% \\
- \multicolumn{1}{l}{CN|CRN|DOI|U|V|Y} & 4\% \\
- \multicolumn{1}{l}{PMID|U} & 4\% \\
- \multicolumn{1}{l}{CN|CRN|DOI|T|V|Y} & 4\% \\
- \multicolumn{1}{l}{CN|CRN|Y} & 4\% \\
- \multicolumn{1}{l}{CN|CRN|DOI|V|Y} & 4\% \\
- \end{tabular}
- \vspace*{2mm}
- \caption{Top 8 combinations of available fields in raw reference data
- accounting for about 53\% of the total data (CN = container name, CRN =
-contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
-issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value.}
- \label{table:fields}
-\end{center}
-\end{table}
-
-Overall, a map-reduce style approach is followed, which allows for some
-uniformity in the overall processing. We extract (key, document) tuples (as
-TSV) from the raw JSON data and sort by key. Then we group documents with the
-same key into groups and apply a function on each group in order to generate
-our target schema (currently named biblioref, or bref for short) or perform
-addition operations (such as deduplication).
-
-The key derivation can be exact (like an identifier like DOI, PMID, etc) or
-based on a normalization procedure, like a slugified title string. For
-identifier based matches we can generate the target biblioref schema directly.
-For fuzzy matching candidates, we pass possible match pairs through a
-verification procedure, which is implemented for release entity schema pairs.
-The current verification procedure is a domain dependent rule based
-verification, able to identify different versions of a publication,
-preprint-published pairs or or other kind of similar documents by calculating
-similarity metrics across title and authors. The fuzzy matching approach is
-applied on all reference documents, which only have a title, but no identifier.
-
-With a few schema conversions, fuzzy matching can be applied to Wikipedia
-articles and Open Library (edition) records as well. The aspect of precision
-and recall are represented by the two stages: we are generous in the match
-candidate generation phase in order to improve recall, but we are strict during
-verification, in order to control precision.
-
-\section{Fuzzy Matching Approach}
-
-% Take sample of 100 docs, report some precision, recall, F1 on a hand curated
-% small subset.
-
-The fuzzy matching approach currently implemented works in two phases: match
-candidate generation and verification. For candidate generation, we map each
-document to a key. We implemented a number of algorithms to form these
-clusters, e.g. title normalizations (including lowercasing, whitespace removal,
-unicode normalization and other measures) or transformations like
-NYSIIS\citep{silbert1970world}.
-
-The verification approach is based on a set of rules, which are tested
-sequentially, yielding a match signal from weak to exact. We use a suite of
-over 300 manually curated match examples\footnote{The table can be found here:
-\href{https://gitlab.com/internetarchive/fuzzycat/-/blob/master/tests/data/verify.csv}{https://gitlab.com/internetarchive/fuzzycat/-/blob/master/tests/data/verify.csv}}
-as part of a unit test suite to allow for a controlled, continuous adjustement
-to the verification procedure. If the verification yields either an exact or
-strong signal, we include consider it a match.
-
-We try to keep the processing steps performant to keep the overall derivation
-time limited. Map and reduce operations are parallelized and certain processing
-steps can process 100K documents per second or even more on commodity hardware
-with spinning disks.
-
-\section{Quality Assurance}
-
-Understanding data quality plays a role, as the data is coming from a myriad of
-sources, each with possible idiosyncratic features or missing values. We employ
-a few QA measures during the process. First, we try to pass each data item
-through only one processing pipeline (e.g. items matched by any identifier
-should not even be considered for fuzzy matching). If duplicate links appear in
-the final dataset nonetheless, we remove them, prefering exact over fuzzy matches.
-
-We employ a couple of data cleaning techniques, e.g. to find and verify
-identifiers like ISBN or to sanitize URLs found in the data. Many of these
-artifacts stem from the fact that large chunks of the raw data come from
-heuristic data extraction from PDF documents.
-
-
-\section{Discussion}
-
-% need to iterate
-
-%\lipsum[2] %\lipsum[3]
-
-
-% \section{Headings: first level} % \label{sec:headings}
-%
-% \lipsum[4] See Section \ref{sec:headings}.
-%
-% \subsection{Headings: second level}
-% \lipsum[5]
-% \begin{equation}
-% \xi _{ij}(t)=P(x_{t}=i,x_{t+1}=j|y,v,w;\theta)= {\frac {\alpha _{i}(t)a^{w_t}_{ij}\beta _{j}(t+1)b^{v_{t+1}}_{j}(y_{t+1})}{\sum _{i=1}^{N} \sum _{j=1}^{N} \alpha _{i}(t)a^{w_t}_{ij}\beta _{j}(t+1)b^{v_{t+1}}_{j}(y_{t+1})}}
-% \end{equation}
-%
-% \subsubsection{Headings: third level}
-% \lipsum[6]
-%
-% \paragraph{Paragraph}
-% \lipsum[7]
-%
-%
-%
-% \section{Examples of citations, figures, tables, references}
-% \label{sec:others}
-%
-% \subsection{Citations}
-% Citations use \verb+natbib+. The documentation may be found at
-% \begin{center}
-% \url{http://mirrors.ctan.org/macros/latex/contrib/natbib/natnotes.pdf}
-% \end{center}
-%
-% Here is an example usage of the two main commands (\verb+citet+ and \verb+citep+): Some people thought a thing \citep{kour2014real, hadash2018estimate} but other people thought something else \citep{kour2014fast}. Many people have speculated that if we knew exactly why \citet{kour2014fast} thought this\dots
-%
-% \subsection{Figures}
-% \lipsum[10]
-% See Figure \ref{fig:fig1}. Here is how you add footnotes. \footnote{Sample of the first footnote.}
-% \lipsum[11]
-%
-% \begin{figure}
-% \centering
-% \fbox{\rule[-.5cm]{4cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
-% \caption{Sample figure caption.}
-% \label{fig:fig1}
-% \end{figure}
-%
-% \subsection{Tables}
-% See awesome Table~\ref{tab:table}.
-%
-% The documentation for \verb+booktabs+ (`Publication quality tables in LaTeX') is available from:
-% \begin{center}
-% \url{https://www.ctan.org/pkg/booktabs}
-% \end{center}
-%
-%
-% \begin{table}
-% \caption{Sample table title}
-% \centering
-% \begin{tabular}{lll}
-% \toprule
-% \multicolumn{2}{c}{Part} \\
-% \cmidrule(r){1-2}
-% Name & Description & Size ($\mu$m) \\
-% \midrule
-% Dendrite & Input terminal & $\sim$100 \\
-% Axon & Output terminal & $\sim$10 \\
-% Soma & Cell body & up to $10^6$ \\
-% \bottomrule
-% \end{tabular}
-% \label{tab:table}
-% \end{table}
-%
-% \subsection{Lists}
-% \begin{itemize}
-% \item Lorem ipsum dolor sit amet
-% \item consectetur adipiscing elit.
-% \item Aliquam dignissim blandit est, in dictum tortor gravida eget. In ac rutrum magna.
-% \end{itemize}
-
-
-\bibliographystyle{unsrtnat}
-\bibliography{references} %%% Uncomment this line and comment out the ``thebibliography'' section below to use the external .bib file (using bibtex) .
-
-
-%%% Uncomment this section and comment out the \bibliography{references} line above to use inline references.
-% \begin{thebibliography}{1}
-
-% \bibitem{kour2014real}
-% George Kour and Raid Saabne.
-% \newblock Real-time segmentation of on-line handwritten arabic script.
-% \newblock In {\em Frontiers in Handwriting Recognition (ICFHR), 2014 14th
-% International Conference on}, pages 417--422. IEEE, 2014.
-
-% \bibitem{kour2014fast}
-% George Kour and Raid Saabne.
-% \newblock Fast classification of handwritten on-line arabic characters.
-% \newblock In {\em Soft Computing and Pattern Recognition (SoCPaR), 2014 6th
-% International Conference of}, pages 312--318. IEEE, 2014.
-
-% \bibitem{hadash2018estimate}
-% Guy Hadash, Einat Kermany, Boaz Carmeli, Ofer Lavi, George Kour, and Alon
-% Jacovi.
-% \newblock Estimate and replace: A novel approach to integrating deep neural
-% networks with existing applications.
-% \newblock {\em arXiv preprint arXiv:1804.09028}, 2018.
-
-% \end{thebibliography}
-
-\section{Appendix}
-
-% Please add the following required packages to your document preamble:
-\begin{table}[]
- \begin{center}
-\begin{tabular}{@{}rlll@{}}
-\toprule
-\textbf{Number of matches} & \textbf{Citation Provenance} & \textbf{Match Status} & \textbf{Match Reason} \\ \midrule
-934932865 & crossref & exact & doi \\
-151366108 & fatcat-datacite & exact & doi \\
-65345275 & fatcat-pubmed & exact & pmid \\
-48778607 & fuzzy & strong & jaccardauthors \\
-42465250 & grobid & exact & doi \\
-29197902 & fatcat-pubmed & exact & doi \\
-19996327 & fatcat-crossref & exact & doi \\
-11996694 & fuzzy & strong & slugtitleauthormatch \\
-9157498 & fuzzy & strong & tokenizedauthors \\
-3547594 & grobid & exact & arxiv \\
-2310025 & fuzzy & exact & titleauthormatch \\
-1496515 & grobid & exact & pmid \\
-680722 & crossref & strong & jaccardauthors \\
-476331 & fuzzy & strong & versioneddoi \\
-449271 & grobid & exact & isbn \\
-230645 & fatcat-crossref & strong & jaccardauthors \\
-190578 & grobid & strong & jaccardauthors \\
-156657 & crossref & exact & isbn \\
-123681 & fatcat-pubmed & strong & jaccardauthors \\
-79328 & crossref & exact & arxiv \\
-57414 & crossref & strong & tokenizedauthors \\
-53480 & fuzzy & strong & pmiddoipair \\
-52453 & fuzzy & strong & dataciterelatedid \\
-47119 & grobid & strong & slugtitleauthormatch \\
-36774 & fuzzy & strong & arxivversion \\
-35311 & fuzzy & strong & customieeearxiv \\
-33863 & grobid & exact & pmcid \\
-23504 & crossref & strong & slugtitleauthormatch \\
-22753 & fatcat-crossref & strong & tokenizedauthors \\
-17720 & grobid & exact & titleauthormatch \\
-14656 & crossref & exact & titleauthormatch \\
-14438 & grobid & strong & tokenizedauthors \\
-7682 & fatcat-crossref & exact & arxiv \\
-5972 & fatcat-crossref & exact & isbn \\
-5525 & fatcat-pubmed & exact & arxiv \\
-4290 & fatcat-pubmed & strong & tokenizedauthors \\
-2745 & fatcat-pubmed & exact & isbn \\
-2342 & fatcat-pubmed & strong & slugtitleauthormatch \\
-2273 & fatcat-crossref & strong & slugtitleauthormatch \\
-1960 & fuzzy & exact & workid \\
-1150 & fatcat-crossref & exact & titleauthormatch \\
-1041 & fatcat-pubmed & exact & titleauthormatch \\
-895 & fuzzy & strong & figshareversion \\
-317 & fuzzy & strong & titleartifact \\
-82 & grobid & strong & titleartifact \\
-33 & crossref & strong & titleartifact \\
-5 & fuzzy & strong & custombsiundated \\
-1 & fuzzy & strong & custombsisubdoc \\
-1 & fatcat & exact & doi \\ \bottomrule
-\end{tabular}
- \vspace*{2mm}
- \caption{Table of match counts, reference provenance, match status and
-match reason. The match reason identifier encode a specific rule in the domain
-dependent verification process and are included for completeness - we do not
-include the details of each rule in this report.}
- \label{table:fields}
-\end{center}
-\end{table}
-
-
-\end{document}
diff --git a/docs/TR-20210730212057-IA-WDS-CG/references.bib b/docs/TR-20210730212057-IA-WDS-CG/references.bib
deleted file mode 100644
index bcb8a16..0000000
--- a/docs/TR-20210730212057-IA-WDS-CG/references.bib
+++ /dev/null
@@ -1,123 +0,0 @@
-@inproceedings{kour2014real,
- title={Real-time segmentation of on-line handwritten arabic script},
- author={Kour, George and Saabne, Raid},
- booktitle={Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on},
- pages={417--422},
- year={2014},
- organization={IEEE}
-}
-
-@inproceedings{kour2014fast,
- title={Fast classification of handwritten on-line Arabic characters},
- author={Kour, George and Saabne, Raid},
- booktitle={Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of},
- pages={312--318},
- year={2014},
- organization={IEEE},
- doi={10.1109/SOCPAR.2014.7008025}
-}
-
-@article{hadash2018estimate,
- title={Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications},
- author={Hadash, Guy and Kermany, Einat and Carmeli, Boaz and Lavi, Ofer and Kour, George and Jacovi, Alon},
- journal={arXiv preprint arXiv:1804.09028},
- year={2018}
-}
-
-@article{garfield1955citation,
- title={Citation indexes for science},
- author={Garfield, Eugene},
- journal={Science},
- volume={122},
- number={3159},
- pages={108--111},
- year={1955},
- publisher={JSTOR}
-}
-
-@inproceedings{lopez2009grobid,
- title={GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications},
- author={Lopez, Patrice},
- booktitle={International conference on theory and practice of digital libraries},
- pages={473--474},
- year={2009},
- organization={Springer}
-}
-
-@article{garfield2007evolution,
- title={The evolution of the science citation index},
- author={Garfield, Eugene},
- journal={International microbiology},
- volume={10},
- number={1},
- pages={65},
- year={2007}
-}
-
-@article{shotton2013publishing,
- title={Publishing: open citations},
- author={Shotton, David},
- journal={Nature News},
- volume={502},
- number={7471},
- pages={295},
- year={2013}
-}
-
-@misc{CitEc,
- title = {Citations in Economics},
- howpublished = {\url{https://citec.repec.org/}},
- note = {Accessed: 2021-07-30}
-}
-
-@inproceedings{wu2019citeseerx,
- title={CiteSeerX: 20 years of service to scholarly big data},
- author={Wu, Jian and Kim, Kunho and Giles, C Lee},
- booktitle={Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse},
- pages={1--4},
- year={2019}
-}
-
-@inproceedings{sinha2015overview,
- title={An overview of microsoft academic service (mas) and applications},
- author={Sinha, Arnab and Shen, Zhihong and Song, Yang and Ma, Hao and Eide, Darrin and Hsu, Bo-June and Wang, Kuansan},
- booktitle={Proceedings of the 24th international conference on world wide web},
- pages={243--246},
- year={2015}
-}
-
-@misc{i4oc,
- title = {Initiative for Open Citations},
-howpublished = {\url{https://i4oc.org/}},
-note = {Accessed: 2021-07-30}
-}
-
-@article{shotton2018funders,
- title={Funders should mandate open citations.},
- author={Shotton, David},
- journal={Nature},
- volume={553},
- number={7686},
- pages={129--130},
- year={2018},
- publisher={Nature Publishing Group}
-}
-
-@article{hutchins2021tipping,
- title={A tipping point for open citation data},
- author={Hutchins, B Ian},
- journal={Quantitative Science Studies},
- pages={1--5},
- year={2021}
-}
-
-@article{silbert1970world,
- title={The World's First Computerized Criminal-Justice Information-Sharing System-The New York State Identification and Intelligence System (NYSIIS)},
- author={Silbert, Jeffrey M},
- journal={Criminology},
- volume={8},
- pages={107},
- year={1970},
- publisher={HeinOnline}
-}
-
diff --git a/docs/Simple/.gitignore b/docs/TR-20210808100000-IA-WDS-REFCAT/.gitignore
index 5040d53..5040d53 100644
--- a/docs/Simple/.gitignore
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/.gitignore
diff --git a/docs/Simple/LICENSE b/docs/TR-20210808100000-IA-WDS-REFCAT/LICENSE
index 9f5c70f..9f5c70f 100644
--- a/docs/Simple/LICENSE
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/LICENSE
diff --git a/docs/Simple/Makefile b/docs/TR-20210808100000-IA-WDS-REFCAT/Makefile
index 11264f8..11264f8 100644
--- a/docs/Simple/Makefile
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/Makefile
diff --git a/docs/Simple/README.md b/docs/TR-20210808100000-IA-WDS-REFCAT/README.md
index 3a56517..3a56517 100644
--- a/docs/Simple/README.md
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/README.md
diff --git a/docs/Simple/figure.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/figure.pdf
index b21876a..b21876a 100644
--- a/docs/Simple/figure.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/figure.pdf
Binary files differ
diff --git a/docs/Simple/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 3b431cc..3b431cc 100644
--- a/docs/Simple/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
Binary files differ
diff --git a/docs/Simple/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index e4febd9..e4febd9 100644
--- a/docs/Simple/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
diff --git a/docs/Simple/refs.bib b/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib
index c61021e..c61021e 100644
--- a/docs/Simple/refs.bib
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib
diff --git a/docs/Simple/simpleConference.sty b/docs/TR-20210808100000-IA-WDS-REFCAT/simpleConference.sty
index d4d4764..d4d4764 100644
--- a/docs/Simple/simpleConference.sty
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/simpleConference.sty