move notes/bulk_edits/ to extra/bulk_edits/

author: Bryan Newbold <bnewbold@robocracy.org> 2021-11-29 14:34:02 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2021-11-29 14:34:02 -0800
commit: c32154f2875a7fb9aac727013e1475cdd811e180 (patch)
tree: f0e061498a101fa824995fb6ec9f91e7e44257e1 /extra/bulk_edits/2020-03-19_arxiv_pubmed.md
parent: c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
download: fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
1 files changed, 57 insertions, 0 deletions
diff --git a/extra/bulk_edits/2020-03-19_arxiv_pubmed.md b/extra/bulk_edits/2020-03-19_arxiv_pubmed.md
new file mode 100644
index 00000000..56e88880
--- /dev/null
+++ b/extra/bulk_edits/2020-03-19_arxiv_pubmed.md
@@ -0,0 +1,57 @@
+
+On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed
+metadata started. In the case of pubmed, updates are enabled, so that recently
+created DOI releases get updated with PMID and extra metadata.
+
+We also want to do last backfills of metadata since the last import up through
+the first day updated by the continuous harvester.
+
+
+## arxiv
+
+The previous date span was 2019-05-22 through 2019-12-20. This time we should
+do 2019-12-20 through today.
+
+First do metha update from last harvest through today, and grab the new daily files:
+
+    metha-sync -format arXivRaw http://export.arxiv.org/oai2
+
+    mkdir arxiv_20191220_20200319
+    cp 2019-12-2* 2019-12-3* 2020-* arxiv_20191220_20200319/
+    tar cf arxiv_20191220_20200319.tar arxiv_20191220_20200319/
+    gzip arxiv_20191220_20200319.tar
+
+Then copy to fatcat server and run import:
+
+    export FATCAT_AUTH_WORKER_ARXIV=...
+
+    ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20191220_20200319/2019-12-31-00000000.xml
+    => Counter({'exists': 1824, 'total': 1001, 'insert': 579, 'skip': 1, 'update': 0})
+
+    fd .xml /srv/fatcat/datasets/arxiv_20191220_20200319/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
+
+Ran fairly quickly only some ~80-90k entities to process.
+
+## PubMed
+
+First, mirror update files from FTP, e.g. via lftp:
+
+    mkdir -p /srv/fatcat/datasets/pubmed_updates
+    lftp -e 'mirror -c /pubmed/updatefiles /srv/fatcat/datasets/pubmed_updates; bye' ftp://ftp.ncbi.nlm.nih.gov
+
+Inspect completed dates from kafka:
+
+    kafkacat -b $KAFKA_BROKER -t fatcat-prod.ftp-pubmed-state -C
+
+Show dates and corresponding files:
+
+    find /srv/fatcat/datasets/pubmed_updates -name "*html" | xargs cat | grep "Created" | sort
+
+For this bulk import, we used files pubmed20n1016.xml.gz (2019-12-16) up to pubmed20n1110.xml.gz (2020-03-06).
+
+To import the corresponding files, run:
+
+    printf "%s\n" /srv/fatcat/datasets/pubmed_updates/pubmed20n{1016..1110}.xml.gz | shuf | \
+        parallel -j16 'gunzip -c {} | ./fatcat_import.py pubmed --do-updates - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt'
+
+Import took 254 min, there were 1715427 PubmedArticle docs in these update files.
author	Bryan Newbold <bnewbold@robocracy.org>	2021-11-29 14:34:02 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2021-11-29 14:34:02 -0800
commit	c32154f2875a7fb9aac727013e1475cdd811e180 (patch)
tree	f0e061498a101fa824995fb6ec9f91e7e44257e1 /extra/bulk_edits/2020-03-19_arxiv_pubmed.md
parent	c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
download	fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip