summaryrefslogtreecommitdiffstats
path: root/python/README_import.md
blob: 9ee24f8ec29699aaac0a4fd9629f4f59f97f83cc (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

Run in order:

- ISSN
- ORCID
- Crossref
- Manifest

Lots of trouble with encoding; always `export LC_ALL=C.UTF-8`

Start off with:

    sudo su webcrawl
    cd /srv/fatcat/src/python
    export LC_ALL=C.UTF-8
    pipenv shell
    export LC_ALL=C.UTF-8

## Data Sources

Download the following; uncompress the sqlite file, but **do not** uncompress
the others:

    cd /srv/fatcat/datasets
    wget https://archive.org/download/crossref_doi_dump_201809/crossref-works.2018-09-05.json.xz
    wget https://archive.org/download/ia_papers_manifest_2018-01-25/index/idents_files_urls.sqlite.gz
    wget https://archive.org/download/ia_journal_metadata_explore_2018-04-05/journal_extra_metadata.csv
    wget https://archive.org/download/issn_issnl_mappings/20180216.ISSN-to-ISSN-L.txt
    wget https://archive.org/download/orcid-dump-2017/public_profiles_API-2.0_2017_10_json.tar.gz
    wget https://archive.org/download/ia_journal_pid_map_munge_20180908/release_ids.ia_munge_20180908.sqlite3.gz
    wget https://archive.org/download/ia_test_paper_matches/2018-08-27-2352.17-matchcrossref.insertable.json.gz
    wget https://archive.org/download/ia_papers_manifest_2018-01-25_matched/ia_papers_manifest_2018-01-25.matched.json.gz

## ISSN

From CSV file:

    # See "start off with" command above
    time ./fatcat_import.py issn /srv/fatcat/datasets/journal_extra_metadata.csv

Usually a couple minutes at most on fast production machine.

## ORCID

Usually tens of minutes on fast production machine.

    time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -

## Crossref

Usually 24 hours or so on fast production machine.

    time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20180216.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3

## Matched

Unknown speed!

    # No file update for the first import...
    zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py matched --no-file-update -

    # ... but do on the second
    zcat /srv/fatcat/datasets/2018-08-27-2352.17-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py matched -

    # GROBID extracted (release+file)
    time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py grobid-metadata -