refcat (wip)
Citation graph related tasks.
- compagnon project: skate
Objective: Given data about releases and references derive various artifacts, e.g.:
- a citation graph; nodes are releases and an edge is a citation (currently, this graph has about 50M nodes and 870M edges)
- a list of referenced entities, like ISSN (container), ISBN (book), URL (webpage), datasets (by URL, DOI, name, ...)
Ongoing Notes
- notes/version_0.md (id only)
- notes/version_1.md (id plus title)
- notes/version_2.md (v1, full schema)
- notes/version_3.md (v2, unstructured)
- notes/version_4.md (v3, extra sources, qa)
Deployment
We are testing a zipapp based deployment (20s for packaging into a 10MB zip file, and copying to target).
Caveat: The development machine needs the same python version (e.g. 3.8) as the target, e.g. for native dependencies. It is relatively easy to have multiple versions of Python available with pyenv.
$ make refcat.pyz && rsync -avP refcat.pyz user@host:/usr/local/bin
On the target you can call (first run will be slower, e.g. 4s, subsequent runs at around 1s startup time).
$ refcat.pyz
____ __
________ / __/________ _/ /_
/ ___/ _ \/ /_/ ___/ __ `/ __/
/ / / __/ __/ /__/ /_/ / /_
/_/ \___/_/ \___/\__,_/\__/
Command line entry point for running various data tasks.
$ refcat.pyz [COMMAND | TASK] [OPTIONS]
Commands: ls, ll, deps, tasks, files, config, cat, completion
To install completion run:
$ source <(refcat.pyz completion)
VERSION 0.1.3
SETTINGS /home/martin/.config/refcat/settings.ini
BASE /magna/refcat
TMPDIR /sandcrawler-db/tmp-refcat
SHIV_ROOT None
Bref OpenLibraryWorksSorted
BrefCombined Refcat
BrefOpenLibraryZipISBN Refs
BrefSortedByWorkID RefsArxiv
BrefZipArxiv RefsByWorkID
BrefZipDOI RefsDOI
BrefZipFuzzy RefsMapped
BrefZipOpenLibrary RefsPMCID
BrefZipPMCID RefsPMID
BrefZipPMID RefsToRelease
FatcatArxiv RefsWithUnstructured
FatcatDOI RefsWithoutIdentifiers
FatcatMapped ReleaseExportExpanded
FatcatPMCID ReleaseExportReduced
FatcatPMID URLList
MAGPapers URLTabs
OpenLibraryAuthorMapping URLTabsCleaned
OpenLibraryAuthors UnmatchedMapped
OpenLibraryDump UnmatchedOpenLibraryMatchTable
OpenLibraryEditions UnmatchedRefs
OpenLibraryEditionsByWork UnmatchedRefsToRelease
OpenLibraryEditionsMapped UnmatchedResolveJournalNames
OpenLibraryEditionsToRelease UnmatchedResolveJournalNamesMapped
OpenLibraryReleaseMapped WikipediaCitationsMinimalDataset
OpenLibraryWorks
Dependencies
TODO
- [ ] wrap up refcat