summaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2020-03-19_arxiv_pubmed.md
blob: 56e888800d5d262ecd25ce16bc5855e4bb6318e7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed
metadata started. In the case of pubmed, updates are enabled, so that recently
created DOI releases get updated with PMID and extra metadata.

We also want to do last backfills of metadata since the last import up through
the first day updated by the continuous harvester.


## arxiv

The previous date span was 2019-05-22 through 2019-12-20. This time we should
do 2019-12-20 through today.

First do metha update from last harvest through today, and grab the new daily files:

    metha-sync -format arXivRaw http://export.arxiv.org/oai2

    mkdir arxiv_20191220_20200319
    cp 2019-12-2* 2019-12-3* 2020-* arxiv_20191220_20200319/
    tar cf arxiv_20191220_20200319.tar arxiv_20191220_20200319/
    gzip arxiv_20191220_20200319.tar

Then copy to fatcat server and run import:

    export FATCAT_AUTH_WORKER_ARXIV=...

    ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20191220_20200319/2019-12-31-00000000.xml
    => Counter({'exists': 1824, 'total': 1001, 'insert': 579, 'skip': 1, 'update': 0})

    fd .xml /srv/fatcat/datasets/arxiv_20191220_20200319/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}

Ran fairly quickly only some ~80-90k entities to process.

## PubMed

First, mirror update files from FTP, e.g. via lftp:

    mkdir -p /srv/fatcat/datasets/pubmed_updates
    lftp -e 'mirror -c /pubmed/updatefiles /srv/fatcat/datasets/pubmed_updates; bye' ftp://ftp.ncbi.nlm.nih.gov

Inspect completed dates from kafka:

    kafkacat -b $KAFKA_BROKER -t fatcat-prod.ftp-pubmed-state -C

Show dates and corresponding files:

    find /srv/fatcat/datasets/pubmed_updates -name "*html" | xargs cat | grep "Created" | sort

For this bulk import, we used files pubmed20n1016.xml.gz (2019-12-16) up to pubmed20n1110.xml.gz (2020-03-06).

To import the corresponding files, run:

    printf "%s\n" /srv/fatcat/datasets/pubmed_updates/pubmed20n{1016..1110}.xml.gz | shuf | \
        parallel -j16 'gunzip -c {} | ./fatcat_import.py pubmed --do-updates - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt'

Import took 254 min, there were 1715427 PubmedArticle docs in these update files.