diff options
Diffstat (limited to 'notes/bulk_edits/2020-03-19_arxiv_pubmed.md')
-rw-r--r-- | notes/bulk_edits/2020-03-19_arxiv_pubmed.md | 57 |
1 files changed, 0 insertions, 57 deletions
diff --git a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md deleted file mode 100644 index 56e88880..00000000 --- a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md +++ /dev/null @@ -1,57 +0,0 @@ - -On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed -metadata started. In the case of pubmed, updates are enabled, so that recently -created DOI releases get updated with PMID and extra metadata. - -We also want to do last backfills of metadata since the last import up through -the first day updated by the continuous harvester. - - -## arxiv - -The previous date span was 2019-05-22 through 2019-12-20. This time we should -do 2019-12-20 through today. - -First do metha update from last harvest through today, and grab the new daily files: - - metha-sync -format arXivRaw http://export.arxiv.org/oai2 - - mkdir arxiv_20191220_20200319 - cp 2019-12-2* 2019-12-3* 2020-* arxiv_20191220_20200319/ - tar cf arxiv_20191220_20200319.tar arxiv_20191220_20200319/ - gzip arxiv_20191220_20200319.tar - -Then copy to fatcat server and run import: - - export FATCAT_AUTH_WORKER_ARXIV=... - - ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20191220_20200319/2019-12-31-00000000.xml - => Counter({'exists': 1824, 'total': 1001, 'insert': 579, 'skip': 1, 'update': 0}) - - fd .xml /srv/fatcat/datasets/arxiv_20191220_20200319/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {} - -Ran fairly quickly only some ~80-90k entities to process. - -## PubMed - -First, mirror update files from FTP, e.g. via lftp: - - mkdir -p /srv/fatcat/datasets/pubmed_updates - lftp -e 'mirror -c /pubmed/updatefiles /srv/fatcat/datasets/pubmed_updates; bye' ftp://ftp.ncbi.nlm.nih.gov - -Inspect completed dates from kafka: - - kafkacat -b $KAFKA_BROKER -t fatcat-prod.ftp-pubmed-state -C - -Show dates and corresponding files: - - find /srv/fatcat/datasets/pubmed_updates -name "*html" | xargs cat | grep "Created" | sort - -For this bulk import, we used files pubmed20n1016.xml.gz (2019-12-16) up to pubmed20n1110.xml.gz (2020-03-06). - -To import the corresponding files, run: - - printf "%s\n" /srv/fatcat/datasets/pubmed_updates/pubmed20n{1016..1110}.xml.gz | shuf | \ - parallel -j16 'gunzip -c {} | ./fatcat_import.py pubmed --do-updates - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt' - -Import took 254 min, there were 1715427 PubmedArticle docs in these update files. |