1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
|
# refcat (wip)
Citation graph related tasks.
* compagnon project: [skate](https://git.archive.org/martin/cgraph/-/tree/master/skate)
Objective: Given data about
[releases](https://guide.fatcat.wiki/entity_release.html) and references derive
various artifacts, e.g.:
* a citation graph; nodes are releases and an edge is a citation (currently,
this graph has about 50M nodes and 870M edges)
* a list of referenced entities, like ISSN (container), ISBN (book), URL
(webpage), datasets (by URL, DOI, name, ...)
## Ongoing Notes
* [notes/version_0.md](version 0) (id only)
* [notes/version_1.md](version 1) (id plus title)
* [notes/version_2.md](version 2) (v1, full schema)
* [notes/version_3.md](version 3) (v2, unstructured)
* [notes/version_4.md](version 4) (v3, extra sources, qa)
## Deployment
We are testing a zipapp based deployment (20s for packaging into a 10MB zip
file, and copying to target).
Caveat: The development machine needs the same python version (e.g. 3.8) as the
target, e.g. for native dependencies. It is relatively easy to have multiple
versions of Python available with [pyenv](https://github.com/pyenv/pyenv).
```
$ make refcat.pyz && rsync -avP refcat.pyz user@host:/usr/local/bin
```
On the target you can call (first run will be slower, e.g. 4s, subsequent runs
at around 1s startup time).
```
$ refcat.pyz
____ __
________ / __/________ _/ /_
/ ___/ _ \/ /_/ ___/ __ `/ __/
/ / / __/ __/ /__/ /_/ / /_
/_/ \___/_/ \___/\__,_/\__/
Command line entry point for running various data tasks.
$ refcat.pyz [COMMAND | TASK] [OPTIONS]
Commands: ls, ll, deps, tasks, files, config, cat, completion
To install completion run:
$ source <(refcat.pyz completion)
VERSION 0.1.3
SETTINGS /home/martin/.config/refcat/settings.ini
BASE /magna/refcat
TMPDIR /sandcrawler-db/tmp-refcat
SHIV_ROOT None
Bref OpenLibraryWorksSorted
BrefCombined Refcat
BrefOpenLibraryZipISBN Refs
BrefSortedByWorkID RefsArxiv
BrefZipArxiv RefsByWorkID
BrefZipDOI RefsDOI
BrefZipFuzzy RefsMapped
BrefZipOpenLibrary RefsPMCID
BrefZipPMCID RefsPMID
BrefZipPMID RefsToRelease
FatcatArxiv RefsWithUnstructured
FatcatDOI RefsWithoutIdentifiers
FatcatMapped ReleaseExportExpanded
FatcatPMCID ReleaseExportReduced
FatcatPMID URLList
MAGPapers URLTabs
OpenLibraryAuthorMapping URLTabsCleaned
OpenLibraryAuthors UnmatchedMapped
OpenLibraryDump UnmatchedOpenLibraryMatchTable
OpenLibraryEditions UnmatchedRefs
OpenLibraryEditionsByWork UnmatchedRefsToRelease
OpenLibraryEditionsMapped UnmatchedResolveJournalNames
OpenLibraryEditionsToRelease UnmatchedResolveJournalNamesMapped
OpenLibraryReleaseMapped WikipediaCitationsMinimalDataset
OpenLibraryWorks
```
## Dependencies
![](notes/deps.png)
|