aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin@archive.org>2021-07-02 18:33:02 +0000
committerMartin Czygan <martin@archive.org>2021-07-02 18:33:02 +0000
commit5504eacd27d6f3ea8d904904728d68efe85e4814 (patch)
tree3eaed91dc481b9a1574fd7d6c4d5268272fe25da
parent88f62c95addbc44e185a4a61697497507db767f9 (diff)
parent9aa0256a5405cfa1ef19b400c345870df2b2e56b (diff)
downloadrefcat-5504eacd27d6f3ea8d904904728d68efe85e4814.tar.gz
refcat-5504eacd27d6f3ea8d904904728d68efe85e4814.zip
Merge branch 'bnewbold-readme' into 'master'
updates to README for public sharing See merge request martin/cgraph!2
-rw-r--r--README.md105
-rw-r--r--TODO.md44
2 files changed, 82 insertions, 67 deletions
diff --git a/README.md b/README.md
index 21828e4..0ee2e3c 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,29 @@
-# cgraph
-Scholarly citation graph related code; maintained by
-[martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep
-all relevant code close.
+![](https://i.imgur.com/6dSaW2q.png)
+
+`cgraph`: large-scale citation graph generation tools
+=====================================================
+
+An assembly of software tools in Python and Go, which together are used to
+compile a citation graph with billions of edges (references) and hundreds of
+millions of nodes (papers).
+
+Maintained by [martin@archive.org](mailto:martin@archive.org) at the Internet
+Archive, as part of the [fatcat](https://fatcat.wiki) and
+[scholar.archive.org](https://scholar.archive.org) projects.
+
+Code is organized into sub-modules, with their own documentation:
-* [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using
+* [python/](python/README.md): mostly [luigi](https://github.com/spotify/luigi) tasks (using
[shiv](https://github.com/linkedin/shiv) for single-file deployments)
-* [skate](skate): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks
+* [skate/](skate/README.md): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks
-Context: [fatcat](https://fatcat.wiki)
+The python code also builds on top of the [fuzzycat](https://pypi.org/project/fuzzycat/) library.
+
+As of June 2021, a copy of the citation graph has not been uploaded publicly, but is expected to be available soon.
+
+
+## Overview
The high level goals of this project are:
@@ -16,6 +31,7 @@ The high level goals of this project are:
* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)
+
The main challenges are:
* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
@@ -24,79 +40,34 @@ The main challenges are:
* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
-We use informal, internal versioning for the graph currently v3, next will be v4/v5.
-
-![](https://i.imgur.com/6dSaW2q.png)
-
-# Grant related tasks
-
-3/4 phases of the grant contain citation graph related tasks.
-
-* [x] Link PID or DOI to archived versions
-
-> As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv.
-
-* [ ] URLs in corpus linked to best possible timestamp (GWB)
-> CDX API probably good for sampling; we'll need to tap into `/user/wmdata2/cdx-all-index/` - (note: try pyspark)
+Internet Archive use cases for the output citation graph include:
-* [ ] Harvest all URLs in citation corpus (maybe do a sample first)
+* discovery tool, e.g. "cited by ..." link for scholar.archive.org
+* lookup things citing this page/book/website/...
+* metadata discovery; e.g. identify popularly cited works which are missing (aka, have no "matched" record in the catalog)
+* Turn All References Blue (TARB)
-> A seed-list (from refs; not from the full-text) is done; need to prepare a
-> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the
-> cluster. A full lookup failed, due to [map
-> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059).
+Original design documents for this project are included in the fatcat git repository: [Bulk Citation Graph (Oct 2020)](https://github.com/internetarchive/fatcat/blob/master/proposals/202008_bulk_citation_graph.md), [Reference Graph API and Schema (Jan 2021](https://github.com/internetarchive/fatcat/blob/master/proposals/2021-01-29_citation_api.md)
-* [ ] Links between records w/o DOI (fuzzy matching)
+## Progress
-> As of v2, we do have a fuzzy matching procedure (yielding about 5-10% of the total results).
-
-* [ ] Publication of augmented citation graph, explore data mining, etc.
-* [ ] Interlinkage with other source, monographs, commercial publications, etc.
-
-> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching)
-
-* [ ] Wikipedia (en) references metadata or archived record
-
-> This is ongoing and should be part of v3.
-
-* [ ] Metadata records for often cited non-scholarly web publications
-* [ ] Collaborations: I4OC, wikicite
-
-We attended an online workshop in 09/2020, organized in part by OCI members;
-recording: [fatcat five minute
-intro](https://archive.org/details/fatcat_workshop_open_citations_open_scholarly_metadata_2020)
-
-# TODO
-
-* [ ] create a first index, ES7 [schema PR](https://git.archive.org/webgroup/fatcat/-/merge_requests/99)
-* [ ] build API, [spec notes](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md)
-
-# IA Use Cases
-
-* [ ] discovery tool, e.g. "cited by ..." link
-* [ ] things citing this page/book/...
-* [ ] metadata discovery; e.g. most cited w/o entry in catalog
-* [ ] Turn All References Blue (TARB)
-
-# Additional notes
-
-* [https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit](https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit)
-
-# Current status
+We use informal, internal versioning for the graph currently v3, next will be v4/v5.
-```
-$ refcat.pyz BiblioRefV2
-```
+Current status (version 2):
-* schema: [https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas)
* matches via: doi, arxiv, pmid, pmcid, fuzzy title matches
* 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed
-# Rough Notes
+Notes by iteration:
* [python/notes/version_0.md](python/notes/version_0.md)
* [python/notes/version_1.md](python/notes/version_1.md)
* [python/notes/version_2.md](python/notes/version_2.md)
* [python/notes/version_3.md](python/notes/version_3.md)
+## Support and Acknowledgements
+
+Work on this software received support from the Andrew W. Mellon Foundation through multiple phases of the ["Ensuring the Persistent Access of Open Access Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)).
+
+Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about).
diff --git a/TODO.md b/TODO.md
new file mode 100644
index 0000000..9e002a7
--- /dev/null
+++ b/TODO.md
@@ -0,0 +1,44 @@
+
+# Grant related tasks
+
+3/4 phases of the grant contain citation graph related tasks.
+
+* [x] Link PID or DOI to archived versions
+
+> As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv.
+
+* [ ] URLs in corpus linked to best possible timestamp (GWB)
+
+> CDX API probably good for sampling; we'll need to tap into `/user/wmdata2/cdx-all-index/` - (note: try pyspark)
+
+* [ ] Harvest all URLs in citation corpus (maybe do a sample first)
+
+> A seed-list (from refs; not from the full-text) is done; need to prepare a
+> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the
+> cluster. A full lookup failed, due to [map
+> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059).
+
+* [ ] Links between records w/o DOI (fuzzy matching)
+
+> As of v2, we do have a fuzzy matching procedure (yielding about 5-10% of the total results).
+
+* [ ] Publication of augmented citation graph, explore data mining, etc.
+* [ ] Interlinkage with other source, monographs, commercial publications, etc.
+
+> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching)
+
+* [ ] Wikipedia (en) references metadata or archived record
+
+> This is ongoing and should be part of v3.
+
+* [ ] Metadata records for often cited non-scholarly web publications
+* [ ] Collaborations: I4OC, wikicite
+
+We attended an online workshop in 09/2020, organized in part by OCI members;
+recording: [fatcat five minute
+intro](https://archive.org/details/fatcat_workshop_open_citations_open_scholarly_metadata_2020)
+
+# TODO
+
+* [ ] create a first index, ES7 [schema PR](https://git.archive.org/webgroup/fatcat/-/merge_requests/99)
+* [ ] build API, [spec notes](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md)