blob: 56e888800d5d262ecd25ce16bc5855e4bb6318e7 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed
metadata started. In the case of pubmed, updates are enabled, so that recently
created DOI releases get updated with PMID and extra metadata.
We also want to do last backfills of metadata since the last import up through
the first day updated by the continuous harvester.
## arxiv
The previous date span was 2019-05-22 through 2019-12-20. This time we should
do 2019-12-20 through today.
First do metha update from last harvest through today, and grab the new daily files:
metha-sync -format arXivRaw http://export.arxiv.org/oai2
mkdir arxiv_20191220_20200319
cp 2019-12-2* 2019-12-3* 2020-* arxiv_20191220_20200319/
tar cf arxiv_20191220_20200319.tar arxiv_20191220_20200319/
gzip arxiv_20191220_20200319.tar
Then copy to fatcat server and run import:
export FATCAT_AUTH_WORKER_ARXIV=...
./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20191220_20200319/2019-12-31-00000000.xml
=> Counter({'exists': 1824, 'total': 1001, 'insert': 579, 'skip': 1, 'update': 0})
fd .xml /srv/fatcat/datasets/arxiv_20191220_20200319/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
Ran fairly quickly only some ~80-90k entities to process.
## PubMed
First, mirror update files from FTP, e.g. via lftp:
mkdir -p /srv/fatcat/datasets/pubmed_updates
lftp -e 'mirror -c /pubmed/updatefiles /srv/fatcat/datasets/pubmed_updates; bye' ftp://ftp.ncbi.nlm.nih.gov
Inspect completed dates from kafka:
kafkacat -b $KAFKA_BROKER -t fatcat-prod.ftp-pubmed-state -C
Show dates and corresponding files:
find /srv/fatcat/datasets/pubmed_updates -name "*html" | xargs cat | grep "Created" | sort
For this bulk import, we used files pubmed20n1016.xml.gz (2019-12-16) up to pubmed20n1110.xml.gz (2020-03-06).
To import the corresponding files, run:
printf "%s\n" /srv/fatcat/datasets/pubmed_updates/pubmed20n{1016..1110}.xml.gz | shuf | \
parallel -j16 'gunzip -c {} | ./fatcat_import.py pubmed --do-updates - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt'
Import took 254 min, there were 1715427 PubmedArticle docs in these update files.
|