diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-05-31 23:57:04 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-05-31 23:57:04 +0200 |
commit | 90c9c2ceea853794a0111d33eef15fcb3a6bf7d5 (patch) | |
tree | d09fe0cf8693f172388b823db3fef0ddd75158db | |
parent | 7fe8b79209a681a2f4d63a023922685b65bce203 (diff) | |
download | refcat-90c9c2ceea853794a0111d33eef15fcb3a6bf7d5.tar.gz refcat-90c9c2ceea853794a0111d33eef15fcb3a6bf7d5.zip |
update notes
-rw-r--r-- | README.md | 13 |
1 files changed, 8 insertions, 5 deletions
@@ -13,15 +13,15 @@ Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21) The high level goals are: * deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata -* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other links +* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia) * publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report) The main challenges are: * currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data) * currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna) -* very partial metadata -* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc.) +* very partial metadata (requiring separate code paths) +* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io)) * fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute) We use informal, internal versioning for the graph currently v2, next will be v3/v4. @@ -42,7 +42,10 @@ We use informal, internal versioning for the graph currently v2, next will be v3 * [ ] Harvest all URLs in citation corpus (maybe do a sample first) -> A seed-list (from refs; not from the full-text) is done; need to prepare a crawl and lookups in GWB. +> A seed-list (from refs; not from the full-text) is done; need to prepare a +> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the +> cluster. A full lookup failed, due to [map +> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059). * [ ] Links between records w/o DOI (fuzzy matching) @@ -51,7 +54,7 @@ We use informal, internal versioning for the graph currently v2, next will be v3 * [ ] Publication of augmented citation graph, explore data mining, etc. * [ ] Interlinkage with other source, monographs, commercial publications, etc. -> As of v3, we have a minimal linkage with wikipedia. +> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching) * [ ] Wikipedia (en) references metadata or archived record |