aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
blob: e4856617b0d1137474d76a609bdd63dc39630b69 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# cgraph

Scholarly citation graph related code; maintained by
[martin@archive.org](mailto:martin@archive.org); multiple subsproject to keep
all relevant code close:

* python: mostly luigi tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments)
* skate: various Go command line tools (wrapped in a deb packaged)

Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21).

# Grant related tasks

3/4 phases of the grant contain citation graph related tasks.

* [ ] Link PID or DOI to archived versions
* [ ] URLs in corpus linked to best possible timestamp (GWB)
* [ ] Harvest all URLs in citation corpus (maybe do a sample first)
* [ ] Links between records w/o DOI (fuzzy matching)
* [ ] Publication of augmented citation graph, explore data mining, etc.
* [ ] Interlinkage with other source, monographs, commercial publications, etc.
* [ ] Wikipedia (en) references metadata or archived record
* [ ] Metadata records for often cited non-scholarly web publications
* [ ] Collaborations: I4OC, wikicite

# IA Use Cases

* [ ] discovery tool, e.g. "cited by ..." link
* [ ] things citing this page/book/...
* [ ] metadata discovery; e.g. most cited w/o entry in catalog

# Additional notes

* [https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit](https://docs.google.com/document/d/1vg_q0lxp6CrGGFS4rR06_TbiROh9nj7UV5NFvueLRn0/edit)

# Current status

```
$ refcat.pyz BiblioRefV2
```

* schema: [https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas](https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md#schemas)
* matches via: doi, arxiv, pmid, pmcid, fuzzy title matches
* 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed

# Rough Notes

* [python/notes/version_0.md](python/notes/version_0.md)
* [python/notes/version_1.md](python/notes/version_1.md)
* [python/notes/version_2.md](python/notes/version_2.md)
* [python/notes/version_3.md](python/notes/version_3.md)