aboutsummaryrefslogtreecommitdiffstats
path: root/python/README.md
blob: f66a517ac0a59ef72d451b4e221fdd6c4b55dded (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# refcat (wip)

Citation graph related tasks.

* compagnon project: [skate](https://git.archive.org/martin/cgraph/-/tree/master/skate)

Objective: Given data about
[releases](https://guide.fatcat.wiki/entity_release.html) and references derive
various artifacts, e.g.:

* a citation graph; nodes are releases and an edge is a citation (currently,
  this graph has about 50M nodes and 870M edges)
* a list of referenced entities, like ISSN (container), ISBN (book), URL
  (webpage), datasets (by URL, DOI, name, ...)

## Ongoing Notes

* [notes/version_0.md](version 0) (id only)
* [notes/version_1.md](version 1) (id plus title)
* [notes/version_2.md](version 2) (v1, full schema)
* [notes/version_3.md](version 3) (v2, unstructured)

## Deployment

We are testing a zipapp based deployment (20s for packaging into a 10MB zip
file, and copying to target).

Caveat: The development machine needs the same python version (e.g. 3.7) as the
target, e.g. for native dependencies. It is relatively easy to have multiple
versions of Python available with [pyenv](https://github.com/pyenv/pyenv).

```
$ make refcat.pyz && rsync -avP refcat.pyz user@host:/usr/local/bin
```

On the target you can call (first run will be slower, e.g. 4s, subsequent runs
at around 1s startup time).

```
$ refcat.pyz


              ____           __
   ________  / __/________ _/ /_
  / ___/ _ \/ /_/ ___/ __ `/ __/
 / /  /  __/ __/ /__/ /_/ / /_
/_/   \___/_/  \___/\__,_/\__/

Command line entry point for running various data tasks.

General usage:

    $ refcat TASK

BASE: /bigger/.cache

BiblioRef                 KeyDistribution           RefsFatcatSortedKeys
BiblioRefFromJoin         RefCounter                RefsFatcatTitleLowerJoin
BiblioRefFuzzy            Refcat                    RefsKeyStats
CommonDOIs                RefsArxiv                 RefsPMCID
CommonTitles              RefsDOIs                  RefsPMID
CommonTitlesLower         RefsDOIsLower             RefsReleasesMerged
FatcatArxiv               RefsFatcatArxivJoin       RefsTitleFrequency
FatcatDOIs                RefsFatcatClusterVerify   RefsTitles
FatcatDOIsLower           RefsFatcatClusters        RefsTitlesLower
FatcatPMCID               RefsFatcatDOIJoin         RefsToRelease
FatcatPMID                RefsFatcatGroupJoin       ReleaseExportExpanded
FatcatTitles              RefsFatcatPMCIDJoin       URLList
FatcatTitlesLower         RefsFatcatPMIDJoin        URLTabs
Input                     RefsFatcatRanked
```