summaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2020-03-19_arxiv_pubmed.md
blob: 25220ad303dd1b691ac22757f06e0922c3fa6261 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed
medata started. In the case of pubmed, updates are enabled, so that recently
created DOI releases get updated with PMID and extra metdata.

We also want to do last backfills of metadata since the last import up through
the first day updated by the continuous harvester.


## arxiv

The previous date span was 2019-05-22 through 2019-12-20. This time we should
do 2019-12-20 through today.

First do metha update from last harvest through today, and grab the new daily files:

    metha-sync -format arXivRaw http://export.arxiv.org/oai2

    mkdir arxiv_20191220_20200319
    cp 2019-12-2* 2019-12-3* 2020-* arxiv_20191220_20200319/
    tar cf arxiv_20191220_20200319.tar arxiv_20191220_20200319/
    gzip arxiv_20191220_20200319.tar

Then copy to fatcat server and run import:

    export FATCAT_AUTH_WORKER_ARXIV=...

    ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20191220_20200319/2019-12-31-00000000.xml
    => Counter({'exists': 1824, 'total': 1001, 'insert': 579, 'skip': 1, 'update': 0})

    fd .xml /srv/fatcat/datasets/arxiv_20191220_20200319/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}

Ran fairly quickly only some ~80-90k entities to process.

## PubMed

TODO: martin will import daily update files from the 2020 baseline through XYZ date.