From 4bcef62ecd98f2719fc4d1cef35394b0bad5cb2b Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@robocracy.org>
Date: Fri, 20 Mar 2020 16:31:22 -0700
Subject: notes on arxiv+pubmed backfill

---
 notes/bulk_edits/2020-03-19_arxiv_pubmed.md | 37 +++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)
 create mode 100644 notes/bulk_edits/2020-03-19_arxiv_pubmed.md

diff --git a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md
new file mode 100644
index 00000000..25220ad3
--- /dev/null
+++ b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md
@@ -0,0 +1,37 @@
+
+On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed
+medata started. In the case of pubmed, updates are enabled, so that recently
+created DOI releases get updated with PMID and extra metdata.
+
+We also want to do last backfills of metadata since the last import up through
+the first day updated by the continuous harvester.
+
+
+## arxiv
+
+The previous date span was 2019-05-22 through 2019-12-20. This time we should
+do 2019-12-20 through today.
+
+First do metha update from last harvest through today, and grab the new daily files:
+
+    metha-sync -format arXivRaw http://export.arxiv.org/oai2
+
+    mkdir arxiv_20191220_20200319
+    cp 2019-12-2* 2019-12-3* 2020-* arxiv_20191220_20200319/
+    tar cf arxiv_20191220_20200319.tar arxiv_20191220_20200319/
+    gzip arxiv_20191220_20200319.tar
+
+Then copy to fatcat server and run import:
+
+    export FATCAT_AUTH_WORKER_ARXIV=...
+
+    ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20191220_20200319/2019-12-31-00000000.xml
+    => Counter({'exists': 1824, 'total': 1001, 'insert': 579, 'skip': 1, 'update': 0})
+
+    fd .xml /srv/fatcat/datasets/arxiv_20191220_20200319/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
+
+Ran fairly quickly only some ~80-90k entities to process.
+
+## PubMed
+
+TODO: martin will import daily update files from the 2020 baseline through XYZ date.
-- 
cgit v1.2.3