1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
|
# cgraph
Scholarly citation graph related code; maintained by
[martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep
all relevant code close.
* [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using
[shiv](https://github.com/linkedin/shiv) for single-file deployments)
* [skate](skate): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks
Context: [fatcat](https://fatcat.wiki)
The high level goals of this project are:
* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)
The main challenges are:
* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
* currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x the space)
* partial metadata (requiring separate code paths)
* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
We use informal, internal versioning for the graph currently v3, next will be v4/v5.
![](https://i.imgur.com/6dSaW2q.png)
# Grant related tasks
3/4 phases of the grant contain citation graph related tasks.
* [x] Link PID or DOI to archived versions
> As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv.
* [ ] URLs in corpus linked to best possible timestamp (GWB)
> CDX API probably good for sampling; we'll need to tap into `/user/wmdata2/cdx-all-index/` - (note: try pyspark)
* [ ] Harvest all URLs in citation corpus (maybe do a sample first)
> A seed-list (from refs; not from the full-text) is done; need to prepare a
> crawl and lookups in GWB. In 05/2021 we did a test lookup of GWB index on the
> cluster. A full lookup failed, due to [map
> spill](https://community.cloudera.com/t5/Support-Questions/Explain-process-of-spilling-in-Hadoop-s-map-reduce-program/m-p/237246/highlight/true#M199059).
* [ ] Links between records w/o DOI (fuzzy matching)
> As of v2, we do have a fuzzy matching procedure (yielding about 5-10% of the total results).
* [ ] Publication of augmented citation graph, explore data mining, etc.
* [ ] Interlinkage with other source, monographs, commercial publications, etc.
> As of v3, we have a minimal linkage with wikipedia. In 05/2021 we extended Open Library matching (isbn, fuzzy matching)
* [ ] Wikipedia (en) references metadata or archived record
> This is ongoing and should be part of v3.
* [ ] Metadata records for often cited non-scholarly web publications
* [ ] Collaborations: I4OC, wikicite
We attended an online workshop in 09/2020, organized in part by OCI members;
recording: [fatcat five minute
intro](https://archive.org/details/fatcat_workshop_open_citations_open_scholarly_metadata_2020)
# TODO
* [ ] create a first index, ES7 [schema PR](https://git.archive.org/webgroup/fatcat/-/merge_requests/99)
* [ ] build API, [spec notes](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md)
# IA Use Cases
* [ ] discovery tool, e.g. "cited by ..." link
* [ ] things citing this page/book/...
* [ ] metadata discovery; e.g. most cited w/o entry in catalog
* [ ] Turn All References Blue (TARB)
# Additional notes
* [https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit](https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit)
# Current status
```
$ refcat.pyz BiblioRefV2
```
* schema: [https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas)
* matches via: doi, arxiv, pmid, pmcid, fuzzy title matches
* 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed
# Rough Notes
* [python/notes/version_0.md](python/notes/version_0.md)
* [python/notes/version_1.md](python/notes/version_1.md)
* [python/notes/version_2.md](python/notes/version_2.md)
* [python/notes/version_3.md](python/notes/version_3.md)
|