diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | bin | 140069 -> 140069 bytes | |||
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 18 |
2 files changed, 9 insertions, 9 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf Binary files differindex 076b8f3..830f25f 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index 35d73b1..0543612 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -98,13 +98,13 @@ reference entries, protocols or datasets. References can be extracted manually or through more automated methods, by accessing relevant metadata or structured data extraction from full text documents. Automated methods offer the benefits of scalability. The completeness of bibliographic metadata in references ranges -from documents with one or more persistant identifiers to raw, potentially +from documents with one or more persistent identifiers to raw, potentially unclean strings partially describing a scholarly artifact. \section{Related Work} Two typical problems in citation graph development are related to data -aquisition and citation matching. Data acquisition itself can take different +acquisition and citation matching. Data acquisition itself can take different forms: bibliographic metadata can contain explicit reference data as provided by publishers and aggregators; this data can be relatively consistent when looked at per source, but may vary in style and comprehensiveness when looked @@ -365,7 +365,7 @@ which is implemented for \emph{release entity}\footnote{\href{https://guide.fatc domain dependent rule based verification, able to identify different versions of a publication, preprint-published pairs and documents, which are are similar by various metrics calculated over title and author fields. The fuzzy matching -approach is applied on all reference documents without identifier (a title is +approach is applied on all reference documents without any identifier (a title is currently required). We currently implement performance sensitive parts in the @@ -383,7 +383,7 @@ GNU \emph{sort}~\citep{mcilroy1971research}. During a last processing step, we fuse reference matches and unmatched items into a single, indexable file. This step includes deduplication of different matching methods (e.g. prefer exact matches over fuzzy matches). This file is -indexed into an search index and serves both matched and unmatched references +indexed into a search index and serves both matched and unmatched references for the web application, allowing for further collection of feedback on match quality and possible improvements. @@ -405,11 +405,11 @@ As other dataset in this field we expect this dataset to be iterated upon. \begin{itemize} \item The fatcat catalog updates its metadata - continously\footnote{A changelog can currenly be followed here: + continuously\footnote{A changelog can currently be followed here: \href{https://fatcat.wiki/changelog}{https://fatcat.wiki/changelog}.} and web crawls are conducted regularly. Current processing pipelines cover raw reference snapshot - creation and derivation of the graph structure, which allows to rerun - processing based on updated data as it becomes available. + creation and derivation of the graph structure, which allows to rerun the + processing pipeline based on updated data as it becomes available. \item Metadata extraction from PDFs depends on supervised machine learning models, which in turn depend on available training datasets. With additional crawls and @@ -517,8 +517,8 @@ more easily (see~Table~\ref{table:matches}). \caption{Table of match counts (top 25), reference provenance, match status and match reason. Provenance currently can name the raw origin (e.g. \emph{crossref}) or the method (e.g. \emph{fuzzy}). The match reason - identifier encode a specific rule in the domain dependent - verification process and are included for completeness - we do not + identifier encodes a specific rule in the domain dependent + verification process and is included for completeness - we do not include the details of each rule in this report.} \label{table:matches} \end{center} |