diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-03-21 00:36:54 +0100 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-03-21 00:36:54 +0100 |
commit | e00e979a8b144231ce16aafe6b8482e4104f5e37 (patch) | |
tree | 942af1fbb0eeb71625438a2aaa0b1d783b84db0e /python/README.md | |
parent | c8d9268759f7da1e050658e135fac0c8f0b6fc53 (diff) | |
download | refcat-e00e979a8b144231ce16aafe6b8482e4104f5e37.tar.gz refcat-e00e979a8b144231ce16aafe6b8482e4104f5e37.zip |
initial import of python tasks
Diffstat (limited to 'python/README.md')
-rw-r--r-- | python/README.md | 68 |
1 files changed, 68 insertions, 0 deletions
diff --git a/python/README.md b/python/README.md new file mode 100644 index 0000000..81db0b0 --- /dev/null +++ b/python/README.md @@ -0,0 +1,68 @@ +# refcat (wip) + +Citation graph related tasks. + +* compagnon repository: [skate](https://github.com/miku/skate) + +Objective: Given data about +[releases](https://guide.fatcat.wiki/entity_release.html) and references derive +various artifacts, e.g.: + +* a citation graph; nodes are releases and an edge is a citation (currently, this graph has about 50M nodes and 870M edges) +* a list of referenced entities, like ISSN (container), ISBN (book), URL (webpage), datasets (by URL, DOI, name, ...) + +## Ongoing Notes + +* [notes/version_0.md](version 0) (id only) +* [notes/version_1.md](version 1) (id plus title) +* [notes/version_2.md](version 2) (v1, full schema) + +## Deployment + +We are testing a zipapp based deployment (20s for packaging into a 10MB zip +file, and copying to target). + +Caveat: The development machine needs the same python version (e.g. 3.7) as the +target, e.g. for native dependencies. It is relatively easy to have multiple +versions of Python available with [pyenv](https://github.com/pyenv/pyenv). + +``` +$ make refcat.pyz && rsync -avP refcat.pyz user@host:/usr/local/bin +``` + +On the target you can call (first run will be slower, e.g. 4s, subsequent runs +at around 1s startup time). + +``` +$ refcat.pyz + + + ____ __ + ________ / __/________ _/ /_ + / ___/ _ \/ /_/ ___/ __ `/ __/ + / / / __/ __/ /__/ /_/ / /_ +/_/ \___/_/ \___/\__,_/\__/ + +Command line entry point for running various data tasks. + +General usage: + + $ refcat TASK + +BASE: /bigger/.cache + +BiblioRef KeyDistribution RefsFatcatSortedKeys +BiblioRefFromJoin RefCounter RefsFatcatTitleLowerJoin +BiblioRefFuzzy Refcat RefsKeyStats +CommonDOIs RefsArxiv RefsPMCID +CommonTitles RefsDOIs RefsPMID +CommonTitlesLower RefsDOIsLower RefsReleasesMerged +FatcatArxiv RefsFatcatArxivJoin RefsTitleFrequency +FatcatDOIs RefsFatcatClusterVerify RefsTitles +FatcatDOIsLower RefsFatcatClusters RefsTitlesLower +FatcatPMCID RefsFatcatDOIJoin RefsToRelease +FatcatPMID RefsFatcatGroupJoin ReleaseExportExpanded +FatcatTitles RefsFatcatPMCIDJoin URLList +FatcatTitlesLower RefsFatcatPMIDJoin URLTabs +Input RefsFatcatRanked +``` |