update notes

author: Martin Czygan <martin.czygan@gmail.com> 2021-05-31 23:57:04 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-05-31 23:57:04 +0200
commit: 90c9c2ceea853794a0111d33eef15fcb3a6bf7d5 (patch)
tree: d09fe0cf8693f172388b823db3fef0ddd75158db
parent: 7fe8b79209a681a2f4d63a023922685b65bce203 (diff)
download: refcat-90c9c2ceea853794a0111d33eef15fcb3a6bf7d5.tar.gz
refcat-90c9c2ceea853794a0111d33eef15fcb3a6bf7d5.zip
1 files changed, 8 insertions, 5 deletions
diff --git a/README.md b/README.md
index 68b6649..eb07572 100644
--- a/README.md
+++ b/README.md
@@ -13,15 +13,15 @@ Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21)
 The high level goals are:
 
 * deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
-* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other links
+* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
 * publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)
 
 The main challenges are:
 
 * currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
 * currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna)
-* very partial metadata
-* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc.)
+* very partial metadata (requiring separate code paths)
+* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
 * fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
 
 We use informal, internal versioning for the graph currently v2, next will be v3/v4.
@@ -42,7 +42,10 @@ We use informal, internal versioning for the graph currently v2, next will be v3
 
 * [ ] Harvest all URLs in citation corpus (maybe do a sample first)
 
-> A seed-list (from refs; not from the full-text) is done; need to prepare a crawl and lookups in GWB.
+> A seed-list (from refs; not from the full-text) is done; need to prepare a
+> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the
+> cluster. A full lookup failed, due to [map
+> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059).
 
 * [ ] Links between records w/o DOI (fuzzy matching)
 
@@ -51,7 +54,7 @@ We use informal, internal versioning for the graph currently v2, next will be v3
 * [ ] Publication of augmented citation graph, explore data mining, etc.
 * [ ] Interlinkage with other source, monographs, commercial publications, etc.
 
-> As of v3, we have a minimal linkage with wikipedia.
+> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching)
 
 * [ ] Wikipedia (en) references metadata or archived record
author	Martin Czygan <martin.czygan@gmail.com>	2021-05-31 23:57:04 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-05-31 23:57:04 +0200
commit	90c9c2ceea853794a0111d33eef15fcb3a6bf7d5 (patch)
tree	d09fe0cf8693f172388b823db3fef0ddd75158db
parent	7fe8b79209a681a2f4d63a023922685b65bce203 (diff)
download	refcat-90c9c2ceea853794a0111d33eef15fcb3a6bf7d5.tar.gz refcat-90c9c2ceea853794a0111d33eef15fcb3a6bf7d5.zip