arxiv bulk update notes

author: Bryan Newbold <bnewbold@robocracy.org> 2019-12-22 13:33:43 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2019-12-22 13:33:43 -0800
commit: 052907bf8af22a2638554b719410b10ac1a8f9b6 (patch)
tree: 03a59d2e166967e544e3c3a383aefab9eec55e43 /notes/bulk_edits
parent: fc6fa5a2d7f24c76d51f9ce2530fed055b20e27f (diff)
download: fatcat-052907bf8af22a2638554b719410b10ac1a8f9b6.tar.gz
fatcat-052907bf8af22a2638554b719410b10ac1a8f9b6.zip
2 files changed, 49 insertions, 2 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md
new file mode 100644
index 00000000..526a0f02
--- /dev/null
+++ b/notes/bulk_edits/2019-12-20_updates.md
@@ -0,0 +1,36 @@
+
+## Arxiv
+
+Used metha-sync tool to update. Then went in raw storage directory (as opposed
+to using `metha-cat`) and plucked out weekly files updated since last import.
+Created a tarball and uploaded to:
+
+    https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz
+
+Downloaded, extracted, then unzipped:
+
+    gunzip *.gz
+
+Run importer:
+
+    export FATCAT_AUTH_WORKER_ARXIV=...
+
+    ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml
+    # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0})
+
+    fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
+
+Things seem to run smoothly in QA. New releases get grouped with old works
+correctly, no duplication obvious.
+
+In prod, loaded just the first file as a start, waiting to see if auto-ingest
+happens. Looks like yes! Great that everything is so smooth. All seem to be new
+captures.
+
+In production prod elasticsearch, 2,377,645 arxiv releases before this
+updated import, 741,033 with files attached. Guessing about 150k new releases,
+but will check.
+
+Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
+781,122 with fulltext.
+
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
index 3aa89b87..80760938 100644
--- a/notes/bulk_edits/CHANGELOG.md
+++ b/notes/bulk_edits/CHANGELOG.md
@@ -9,13 +9,24 @@ this file should probably get merged into the guide at some point.
 
 This file should not turn in to a TODO list!
 
+## 2019-12
+
+Inserted about 154k new arxiv release entities. Still no automatic daily
+harvesting.
+
+"Save Paper Now" importer running. This bot only *submits* editgroups for
+review, doesn't auto-accept them.
+
+## 2019-11
+
+Daily ingest of fulltext for OA releases now enabled. New file entities created
+and merged automatically.
+
 ## 2019-10
 
 Inserted 1.45m new release entities from Crossref which had been missed during
 a previous gap in continuous metadata harvesting.
 
-## 2019-10
-
 Updated 304,308 file entities to remove broken
 "https://web.archive.org/web/None/*" URLs.
author	Bryan Newbold <bnewbold@robocracy.org>	2019-12-22 13:33:43 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2019-12-22 13:33:43 -0800
commit	052907bf8af22a2638554b719410b10ac1a8f9b6 (patch)
tree	03a59d2e166967e544e3c3a383aefab9eec55e43 /notes/bulk_edits
parent	fc6fa5a2d7f24c76d51f9ce2530fed055b20e27f (diff)
download	fatcat-052907bf8af22a2638554b719410b10ac1a8f9b6.tar.gz fatcat-052907bf8af22a2638554b719410b10ac1a8f9b6.zip