aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-05-31 23:57:04 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-05-31 23:57:04 +0200
commit90c9c2ceea853794a0111d33eef15fcb3a6bf7d5 (patch)
treed09fe0cf8693f172388b823db3fef0ddd75158db
parent7fe8b79209a681a2f4d63a023922685b65bce203 (diff)
downloadrefcat-90c9c2ceea853794a0111d33eef15fcb3a6bf7d5.tar.gz
refcat-90c9c2ceea853794a0111d33eef15fcb3a6bf7d5.zip
update notes
-rw-r--r--README.md13
1 files changed, 8 insertions, 5 deletions
diff --git a/README.md b/README.md
index 68b6649..eb07572 100644
--- a/README.md
+++ b/README.md
@@ -13,15 +13,15 @@ Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21)
The high level goals are:
* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
-* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other links
+* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)
The main challenges are:
* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
* currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna)
-* very partial metadata
-* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc.)
+* very partial metadata (requiring separate code paths)
+* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
We use informal, internal versioning for the graph currently v2, next will be v3/v4.
@@ -42,7 +42,10 @@ We use informal, internal versioning for the graph currently v2, next will be v3
* [ ] Harvest all URLs in citation corpus (maybe do a sample first)
-> A seed-list (from refs; not from the full-text) is done; need to prepare a crawl and lookups in GWB.
+> A seed-list (from refs; not from the full-text) is done; need to prepare a
+> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the
+> cluster. A full lookup failed, due to [map
+> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059).
* [ ] Links between records w/o DOI (fuzzy matching)
@@ -51,7 +54,7 @@ We use informal, internal versioning for the graph currently v2, next will be v3
* [ ] Publication of augmented citation graph, explore data mining, etc.
* [ ] Interlinkage with other source, monographs, commercial publications, etc.
-> As of v3, we have a minimal linkage with wikipedia.
+> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching)
* [ ] Wikipedia (en) references metadata or archived record