aboutsummaryrefslogtreecommitdiffstats
path: root/python/README_import.md
blob: 027ce7ad19cb7769a3ab85bc20f77e98e524f6f0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129

Run in order:

- ISSN
- ORCID
- Crossref
- Manifest

Lots of trouble with encoding; always `export LC_ALL=C.UTF-8`

Start off with:

    sudo su webcrawl
    cd /srv/fatcat/src/python
    export LC_ALL=C.UTF-8
    pipenv shell
    export LC_ALL=C.UTF-8

## Data Sources

Download the following; uncompress the sqlite file, but **do not** uncompress
the others:

    cd /srv/fatcat/datasets
    wget https://archive.org/download/crossref_doi_dump_201809/crossref-works.2018-09-05.json.xz
    wget https://archive.org/download/ia_papers_manifest_2018-01-25/index/idents_files_urls.sqlite.gz
    wget https://archive.org/download/ia_journal_metadata_explore_2018-04-05/journal_extra_metadata.csv
    wget https://archive.org/download/issn_issnl_mappings/20180216.ISSN-to-ISSN-L.txt
    wget https://archive.org/download/orcid-dump-2017/public_profiles_1_2_json.all.json.gz
    wget https://archive.org/download/ia_journal_pid_map_munge_20180908/release_ids.ia_munge_20180908.sqlite3.gz
    wget https://archive.org/download/ia_test_paper_matches/2018-08-27-2352.17-matchcrossref.insertable.json.gz
    wget https://archive.org/download/ia_papers_manifest_2018-01-25_matched/ia_papers_manifest_2018-01-25.matched.json.gz

    gunzip public_profiles_1_2_json.all.json.gz

## ISSN

From CSV file:

    # See "start off with" command above
    time ./fatcat_import.py issn /srv/fatcat/datasets/journal_extra_metadata.csv

Usually a couple minutes at most on fast production machine.

## ORCID

Usually tens of minutes on fast production machine.

    time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -

## Crossref

Usually 24 hours or so on fast production machine.

    time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20180216.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3

## JALC

First import a random subset single threaded to create (most) containers. On a
fast machine, this takes a couple minutes.

    time ./fatcat_import.py jalc /srv/fatcat/datasets/JALC-LOD-20180907.sample10k.rdf /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3

Then, in parallel:

    zcat /srv/fatcat/datasets/JALC-LOD-20180907.gz | pv -l | time parallel -j20 --round-robin --pipe ./fatcat_import.py jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3

## JSTOR

Looks like:

    fd . /data/jstor/metadata/ | time parallel -j20 --round-robin --pipe ./fatcat_import.py jstor - /data/issn/20190129.ISSN-to-ISSN-L.txt

## arXiv

Single file:

    ./fatcat_import.py arxiv /srv/fatcat/datasets/arxiv_raw_oai_snapshot_2019-05-22/2007-12-31-00000001.xml

Bulk (one file per process):

    fd .xml /srv/fatcat/datasets/arxiv_raw_oai_snapshot_2019-05-22/ | parallel -j15 ./fatcat_import.py arxiv {}

## PubMed

Run single:

    time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2019/pubmed19n0400.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt

    real    13m21.756s
    user    9m10.720s
    sys     0m14.100s

Bulk:

    # very memory intensive to parse these big XML files, so need to limit parallelism
    fd .xml /srv/fatcat/datasets/pubmed_medline_baseline_2019 | time parallel -j3 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt

## Matched

These each take 2-4 hours:

    # No file update for the first import...
    time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py matched --no-file-updates -

    # ... but do on the second
    zcat /srv/fatcat/datasets/2018-08-27-2352.17-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py matched -

    # GROBID extracted (release+file)
    time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py grobid-metadata -

## Arabesque Matches

Prep JSON files from sqlite (for parallel import):

    ~/arabesque/arabesque.py dump_json s2_doi.sqlite --only-identifier-hits | pv -l | gzip > s2_doi.json.gz

Run import in parallel:

    export FATCAT_AUTH_WORKER_CRAWL=...
    zcat /srv/fatcat/datasets/s2_doi.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py arabesque --json-file - --extid-type doi --crawl-id DIRECT-OA-CRAWL-2019 --no-require-grobid

## Other Matched

    export FATCAT_EDITGROUP_DESCRIPTION="File/DOI matching to user-uploaded pre-1923 and pre-1909 paper corpus on archive.org"
    export FATCAT_API_AUTH_TOKEN=... (FATCAT_AUTH_WORKER_ARCHIVE_ORG)

    zcat /srv/fatcat/datasets/crossref-pre-1923-scholarly-works.matched.json.gz | time parallel -j12 --round-robin --pipe ./fatcat_import.py matched - --default-mime 'application/pdf'