From 9aa0256a5405cfa1ef19b400c345870df2b2e56b Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 1 Jul 2021 18:26:25 -0700 Subject: updates to README for public sharing --- README.md | 105 +++++++++++++++++++++++--------------------------------------- TODO.md | 44 ++++++++++++++++++++++++++ 2 files changed, 82 insertions(+), 67 deletions(-) create mode 100644 TODO.md diff --git a/README.md b/README.md index 21828e4..0ee2e3c 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,29 @@ -# cgraph -Scholarly citation graph related code; maintained by -[martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep -all relevant code close. +![](https://i.imgur.com/6dSaW2q.png) + +`cgraph`: large-scale citation graph generation tools +===================================================== + +An assembly of software tools in Python and Go, which together are used to +compile a citation graph with billions of edges (references) and hundreds of +millions of nodes (papers). + +Maintained by [martin@archive.org](mailto:martin@archive.org) at the Internet +Archive, as part of the [fatcat](https://fatcat.wiki) and +[scholar.archive.org](https://scholar.archive.org) projects. + +Code is organized into sub-modules, with their own documentation: -* [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using +* [python/](python/README.md): mostly [luigi](https://github.com/spotify/luigi) tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments) -* [skate](skate): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks +* [skate/](skate/README.md): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks -Context: [fatcat](https://fatcat.wiki) +The python code also builds on top of the [fuzzycat](https://pypi.org/project/fuzzycat/) library. + +As of June 2021, a copy of the citation graph has not been uploaded publicly, but is expected to be available soon. + + +## Overview The high level goals of this project are: @@ -16,6 +31,7 @@ The high level goals of this project are: * beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia) * publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report) + The main challenges are: * currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data) @@ -24,79 +40,34 @@ The main challenges are: * data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io)) * fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute) -We use informal, internal versioning for the graph currently v3, next will be v4/v5. - -![](https://i.imgur.com/6dSaW2q.png) - -# Grant related tasks - -3/4 phases of the grant contain citation graph related tasks. - -* [x] Link PID or DOI to archived versions - -> As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv. - -* [ ] URLs in corpus linked to best possible timestamp (GWB) -> CDX API probably good for sampling; we'll need to tap into `/user/wmdata2/cdx-all-index/` - (note: try pyspark) +Internet Archive use cases for the output citation graph include: -* [ ] Harvest all URLs in citation corpus (maybe do a sample first) +* discovery tool, e.g. "cited by ..." link for scholar.archive.org +* lookup things citing this page/book/website/... +* metadata discovery; e.g. identify popularly cited works which are missing (aka, have no "matched" record in the catalog) +* Turn All References Blue (TARB) -> A seed-list (from refs; not from the full-text) is done; need to prepare a -> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the -> cluster. A full lookup failed, due to [map -> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059). +Original design documents for this project are included in the fatcat git repository: [Bulk Citation Graph (Oct 2020)](https://github.com/internetarchive/fatcat/blob/master/proposals/202008_bulk_citation_graph.md), [Reference Graph API and Schema (Jan 2021](https://github.com/internetarchive/fatcat/blob/master/proposals/2021-01-29_citation_api.md) -* [ ] Links between records w/o DOI (fuzzy matching) +## Progress -> As of v2, we do have a fuzzy matching procedure (yielding about 5-10% of the total results). - -* [ ] Publication of augmented citation graph, explore data mining, etc. -* [ ] Interlinkage with other source, monographs, commercial publications, etc. - -> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching) - -* [ ] Wikipedia (en) references metadata or archived record - -> This is ongoing and should be part of v3. - -* [ ] Metadata records for often cited non-scholarly web publications -* [ ] Collaborations: I4OC, wikicite - -We attended an online workshop in 09/2020, organized in part by OCI members; -recording: [fatcat five minute -intro](https://archive.org/details/fatcat_workshop_open_citations_open_scholarly_metadata_2020) - -# TODO - -* [ ] create a first index, ES7 [schema PR](https://git.archive.org/webgroup/fatcat/-/merge_requests/99) -* [ ] build API, [spec notes](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md) - -# IA Use Cases - -* [ ] discovery tool, e.g. "cited by ..." link -* [ ] things citing this page/book/... -* [ ] metadata discovery; e.g. most cited w/o entry in catalog -* [ ] Turn All References Blue (TARB) - -# Additional notes - -* [https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit](https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit) - -# Current status +We use informal, internal versioning for the graph currently v3, next will be v4/v5. -``` -$ refcat.pyz BiblioRefV2 -``` +Current status (version 2): -* schema: [https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas) * matches via: doi, arxiv, pmid, pmcid, fuzzy title matches * 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed -# Rough Notes +Notes by iteration: * [python/notes/version_0.md](python/notes/version_0.md) * [python/notes/version_1.md](python/notes/version_1.md) * [python/notes/version_2.md](python/notes/version_2.md) * [python/notes/version_3.md](python/notes/version_3.md) +## Support and Acknowledgements + +Work on this software received support from the Andrew W. Mellon Foundation through multiple phases of the ["Ensuring the Persistent Access of Open Access Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)). + +Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about). diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..9e002a7 --- /dev/null +++ b/TODO.md @@ -0,0 +1,44 @@ + +# Grant related tasks + +3/4 phases of the grant contain citation graph related tasks. + +* [x] Link PID or DOI to archived versions + +> As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv. + +* [ ] URLs in corpus linked to best possible timestamp (GWB) + +> CDX API probably good for sampling; we'll need to tap into `/user/wmdata2/cdx-all-index/` - (note: try pyspark) + +* [ ] Harvest all URLs in citation corpus (maybe do a sample first) + +> A seed-list (from refs; not from the full-text) is done; need to prepare a +> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the +> cluster. A full lookup failed, due to [map +> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059). + +* [ ] Links between records w/o DOI (fuzzy matching) + +> As of v2, we do have a fuzzy matching procedure (yielding about 5-10% of the total results). + +* [ ] Publication of augmented citation graph, explore data mining, etc. +* [ ] Interlinkage with other source, monographs, commercial publications, etc. + +> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching) + +* [ ] Wikipedia (en) references metadata or archived record + +> This is ongoing and should be part of v3. + +* [ ] Metadata records for often cited non-scholarly web publications +* [ ] Collaborations: I4OC, wikicite + +We attended an online workshop in 09/2020, organized in part by OCI members; +recording: [fatcat five minute +intro](https://archive.org/details/fatcat_workshop_open_citations_open_scholarly_metadata_2020) + +# TODO + +* [ ] create a first index, ES7 [schema PR](https://git.archive.org/webgroup/fatcat/-/merge_requests/99) +* [ ] build API, [spec notes](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md) -- cgit v1.2.3