aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-16 20:12:02 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-16 20:12:02 +0200
commit4ba2ab5f9290865d2046fd3b38c6b68d167636d6 (patch)
tree98a2f379931368e81dd711ad37e866dc51b51563
parenteb6dec279d66d35433f0ea7df1c1399896b111ce (diff)
downloadrefcat-4ba2ab5f9290865d2046fd3b38c6b68d167636d6.tar.gz
refcat-4ba2ab5f9290865d2046fd3b38c6b68d167636d6.zip
tweak README
-rw-r--r--README.md20
1 files changed, 13 insertions, 7 deletions
diff --git a/README.md b/README.md
index 15d84ce..5ab7e65 100644
--- a/README.md
+++ b/README.md
@@ -38,19 +38,25 @@ The high level goals of this project are:
The main challenges are:
-* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
-* currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x the space)
+* currently 2.5B references documents (~1TB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
+* currently a single machine setup (16 cores, 16T disk; note: we compress with
+ [zstd](https://github.com/facebook/zstd), which gives us about 5x space, 2x
+ speedup)
* partial metadata (requiring separate code paths)
-* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
+* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc.
+ since a good chunk of the metadata comes from ML based [PDF metadata
+ extraction](https://grobid.readthedocs.io))
* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
Internet Archive use cases for the output citation graph include:
-* discovery tool, e.g. "cited by ..." link for scholar.archive.org
-* lookup things citing this page/book/website/...
-* metadata discovery; e.g. identify popularly cited works which are missing (aka, have no "matched" record in the catalog)
-* Turn All References Blue (TARB)
+* discovery tool, e.g. "cited by ..." link on [fatcat.wiki](https://fatcat.wiki/release/bza3ovudezahlexibdtoytgtb4/refs-in)
+* lookup things cited by a [wikipedia page](https://fatcat.wiki/wikipedia/en:Internet/refs-out), papers citing [books](https://fatcat.wiki/openlibrary/OL2141999W/refs-in) or papers referencing web pages (wip)
+* metadata discovery; e.g. identify popularly cited works which are missing
+ (aka, have [no *matched*](https://git.archive.org/webgroup/refcat/-/blob/eb6dec279d66d35433f0ea7df1c1399896b111ce/python/refcat/tasks.py#L461-488)
+ record in the catalog)
+* Turn All References Blue (TARB, [notes](https://meta.wikimedia.org/wiki/GLAMTLV2018/Submissions/Turn_All_References_Blue!), [presentation](https://archive.org/details/mark-graham-presentation))
Original design documents for this project are included in the fatcat git repository: [Bulk Citation Graph (Oct 2020)](https://github.com/internetarchive/fatcat/blob/master/proposals/202008_bulk_citation_graph.md), [Reference Graph API and Schema (Jan 2021](https://github.com/internetarchive/fatcat/blob/master/proposals/2021-01-29_citation_api.md)