aboutsummaryrefslogtreecommitdiffstats
path: root/python/README.md
blob: 071286cdabfc13a1831b078cd2c60aa536b7f736 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# refcat (wip)

Citation graph related tasks.

* compagnon project: [skate](https://git.archive.org/martin/cgraph/-/tree/master/skate)

Objective: Given data about
[releases](https://guide.fatcat.wiki/entity_release.html) and references derive
various artifacts, e.g.:

* a citation graph; nodes are releases and an edge is a citation (currently,
  this graph has about 50M nodes and 870M edges)
* a list of referenced entities, like ISSN (container), ISBN (book), URL
  (webpage), datasets (by URL, DOI, name, ...)

## Ongoing Notes

* [notes/version_0.md](version 0) (id only)
* [notes/version_1.md](version 1) (id plus title)
* [notes/version_2.md](version 2) (v1, full schema)
* [notes/version_3.md](version 3) (v2, unstructured)
* [notes/version_4.md](version 4) (v3, extra sources, qa)

## Deployment

We are testing a zipapp based deployment (20s for packaging into a 10MB zip
file, and copying to target).

Caveat: The development machine needs the same python version (e.g. 3.8) as the
target, e.g. for native dependencies. It is relatively easy to have multiple
versions of Python available with [pyenv](https://github.com/pyenv/pyenv).

```
$ make refcat.pyz && rsync -avP refcat.pyz user@host:/usr/local/bin
```

On the target you can call (first run will be slower, e.g. 4s, subsequent runs
at around 1s startup time).

```
$ refcat.pyz


              ____           __
   ________  / __/________ _/ /_
  / ___/ _ \/ /_/ ___/ __ `/ __/
 / /  /  __/ __/ /__/ /_/ / /_
/_/   \___/_/  \___/\__,_/\__/

Command line entry point for running various data tasks.

    $ refcat.pyz [COMMAND | TASK] [OPTIONS]

Commands: ls, ll, deps, tasks, files, config, cat, completion

To install completion run:

    $ source <(refcat.pyz completion)

VERSION   0.1.3
SETTINGS  /home/martin/.config/refcat/settings.ini
BASE      /magna/refcat
TMPDIR    /sandcrawler-db/tmp-refcat
SHIV_ROOT None

Bref                                OpenLibraryWorksSorted
BrefCombined                        Refcat
BrefOpenLibraryZipISBN              Refs
BrefSortedByWorkID                  RefsArxiv
BrefZipArxiv                        RefsByWorkID
BrefZipDOI                          RefsDOI
BrefZipFuzzy                        RefsMapped
BrefZipOpenLibrary                  RefsPMCID
BrefZipPMCID                        RefsPMID
BrefZipPMID                         RefsToRelease
FatcatArxiv                         RefsWithUnstructured
FatcatDOI                           RefsWithoutIdentifiers
FatcatMapped                        ReleaseExportExpanded
FatcatPMCID                         ReleaseExportReduced
FatcatPMID                          URLList
MAGPapers                           URLTabs
OpenLibraryAuthorMapping            URLTabsCleaned
OpenLibraryAuthors                  UnmatchedMapped
OpenLibraryDump                     UnmatchedOpenLibraryMatchTable
OpenLibraryEditions                 UnmatchedRefs
OpenLibraryEditionsByWork           UnmatchedRefsToRelease
OpenLibraryEditionsMapped           UnmatchedResolveJournalNames
OpenLibraryEditionsToRelease        UnmatchedResolveJournalNamesMapped
OpenLibraryReleaseMapped            WikipediaCitationsMinimalDataset
OpenLibraryWorks
```

## Dependencies

![](notes/deps.png)


## TODO

* [ ] wrap up refcat