From 4ba2ab5f9290865d2046fd3b38c6b68d167636d6 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Mon, 16 Aug 2021 20:12:02 +0200 Subject: tweak README --- README.md | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 15d84ce..5ab7e65 100644 --- a/README.md +++ b/README.md @@ -38,19 +38,25 @@ The high level goals of this project are: The main challenges are: -* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data) -* currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x the space) +* currently 2.5B references documents (~1TB raw textual data); possibly going up to 2-4B (1-2TB raw textual data) +* currently a single machine setup (16 cores, 16T disk; note: we compress with + [zstd](https://github.com/facebook/zstd), which gives us about 5x space, 2x + speedup) * partial metadata (requiring separate code paths) -* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io)) +* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. + since a good chunk of the metadata comes from ML based [PDF metadata + extraction](https://grobid.readthedocs.io)) * fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute) Internet Archive use cases for the output citation graph include: -* discovery tool, e.g. "cited by ..." link for scholar.archive.org -* lookup things citing this page/book/website/... -* metadata discovery; e.g. identify popularly cited works which are missing (aka, have no "matched" record in the catalog) -* Turn All References Blue (TARB) +* discovery tool, e.g. "cited by ..." link on [fatcat.wiki](https://fatcat.wiki/release/bza3ovudezahlexibdtoytgtb4/refs-in) +* lookup things cited by a [wikipedia page](https://fatcat.wiki/wikipedia/en:Internet/refs-out), papers citing [books](https://fatcat.wiki/openlibrary/OL2141999W/refs-in) or papers referencing web pages (wip) +* metadata discovery; e.g. identify popularly cited works which are missing + (aka, have [no *matched*](https://git.archive.org/webgroup/refcat/-/blob/eb6dec279d66d35433f0ea7df1c1399896b111ce/python/refcat/tasks.py#L461-488) + record in the catalog) +* Turn All References Blue (TARB, [notes](https://meta.wikimedia.org/wiki/GLAMTLV2018/Submissions/Turn_All_References_Blue!), [presentation](https://archive.org/details/mark-graham-presentation)) Original design documents for this project are included in the fatcat git repository: [Bulk Citation Graph (Oct 2020)](https://github.com/internetarchive/fatcat/blob/master/proposals/202008_bulk_citation_graph.md), [Reference Graph API and Schema (Jan 2021](https://github.com/internetarchive/fatcat/blob/master/proposals/2021-01-29_citation_api.md) -- cgit v1.2.3