aboutsummaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.pdfbin140069 -> 140069 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex18
2 files changed, 9 insertions, 9 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 076b8f3..830f25f 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
Binary files differ
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index 35d73b1..0543612 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -98,13 +98,13 @@ reference entries, protocols or datasets. References can be extracted manually
or through more automated methods, by accessing relevant metadata or structured
data extraction from full text documents. Automated methods offer the benefits
of scalability. The completeness of bibliographic metadata in references ranges
-from documents with one or more persistant identifiers to raw, potentially
+from documents with one or more persistent identifiers to raw, potentially
unclean strings partially describing a scholarly artifact.
\section{Related Work}
Two typical problems in citation graph development are related to data
-aquisition and citation matching. Data acquisition itself can take different
+acquisition and citation matching. Data acquisition itself can take different
forms: bibliographic metadata can contain explicit reference data as provided
by publishers and aggregators; this data can be relatively consistent when
looked at per source, but may vary in style and comprehensiveness when looked
@@ -365,7 +365,7 @@ which is implemented for \emph{release entity}\footnote{\href{https://guide.fatc
domain dependent rule based verification, able to identify different versions
of a publication, preprint-published pairs and documents, which are
are similar by various metrics calculated over title and author fields. The fuzzy matching
-approach is applied on all reference documents without identifier (a title is
+approach is applied on all reference documents without any identifier (a title is
currently required).
We currently implement performance sensitive parts in the
@@ -383,7 +383,7 @@ GNU \emph{sort}~\citep{mcilroy1971research}.
During a last processing step, we fuse reference matches and unmatched items
into a single, indexable file. This step includes deduplication of different
matching methods (e.g. prefer exact matches over fuzzy matches). This file is
-indexed into an search index and serves both matched and unmatched references
+indexed into a search index and serves both matched and unmatched references
for the web application, allowing for further collection of feedback on match
quality and possible improvements.
@@ -405,11 +405,11 @@ As other dataset in this field we expect this dataset to be iterated upon.
\begin{itemize}
\item The fatcat catalog updates its metadata
- continously\footnote{A changelog can currenly be followed here:
+ continuously\footnote{A changelog can currently be followed here:
\href{https://fatcat.wiki/changelog}{https://fatcat.wiki/changelog}.} and web crawls are conducted
regularly. Current processing pipelines cover raw reference snapshot
- creation and derivation of the graph structure, which allows to rerun
- processing based on updated data as it becomes available.
+ creation and derivation of the graph structure, which allows to rerun the
+ processing pipeline based on updated data as it becomes available.
\item Metadata extraction from PDFs depends on supervised machine learning
models, which in turn depend on available training datasets. With additional crawls and
@@ -517,8 +517,8 @@ more easily (see~Table~\ref{table:matches}).
\caption{Table of match counts (top 25), reference provenance, match
status and match reason. Provenance currently can name the raw
origin (e.g. \emph{crossref}) or the method (e.g. \emph{fuzzy}). The match reason
- identifier encode a specific rule in the domain dependent
- verification process and are included for completeness - we do not
+ identifier encodes a specific rule in the domain dependent
+ verification process and is included for completeness - we do not
include the details of each rule in this report.}
\label{table:matches}
\end{center}