# cgraph Scholarly citation graph related code; maintained by [martin@archive.org](mailto:martin@archive.org); multiple subsproject to keep all relevant code close: * python: mostly luigi tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments) * skate: various Go command line tools (wrapped in a deb packaged) Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21). We use informal, internal versioning, currently v2, next will be v3. # Grant related tasks 3/4 phases of the grant contain citation graph related tasks. * [x] Link PID or DOI to archived versions As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv. * [ ] URLs in corpus linked to best possible timestamp (GWB) * [ ] Harvest all URLs in citation corpus (maybe do a sample first) A seed-list (from refs; not from the full-text) is done; need to prepare a crawl and lookups in GWB. * [ ] Links between records w/o DOI (fuzzy matching) As of v2, we do have a fuzzy matching procedure (yielding about 5-10% of the total results). * [ ] Publication of augmented citation graph, explore data mining, etc. * [ ] Interlinkage with other source, monographs, commercial publications, etc. As of v3, we have a minimal linkage with wikipedia. * [ ] Wikipedia (en) references metadata or archived record This is ongoing and should be part of v3. * [ ] Metadata records for often cited non-scholarly web publications * [ ] Collaborations: I4OC, wikicite # IA Use Cases * [ ] discovery tool, e.g. "cited by ..." link * [ ] things citing this page/book/... * [ ] metadata discovery; e.g. most cited w/o entry in catalog # Additional notes * [https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit](https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit) # Current status ``` $ refcat.pyz BiblioRefV2 ``` * schema: [https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas) * matches via: doi, arxiv, pmid, pmcid, fuzzy title matches * 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed # Rough Notes * [python/notes/version_0.md](python/notes/version_0.md) * [python/notes/version_1.md](python/notes/version_1.md) * [python/notes/version_2.md](python/notes/version_2.md) * [python/notes/version_3.md](python/notes/version_3.md)