summaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2019-12-20_updates.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/bulk_edits/2019-12-20_updates.md')
-rw-r--r--notes/bulk_edits/2019-12-20_updates.md36
1 files changed, 36 insertions, 0 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md
new file mode 100644
index 00000000..526a0f02
--- /dev/null
+++ b/notes/bulk_edits/2019-12-20_updates.md
@@ -0,0 +1,36 @@
+
+## Arxiv
+
+Used metha-sync tool to update. Then went in raw storage directory (as opposed
+to using `metha-cat`) and plucked out weekly files updated since last import.
+Created a tarball and uploaded to:
+
+ https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz
+
+Downloaded, extracted, then unzipped:
+
+ gunzip *.gz
+
+Run importer:
+
+ export FATCAT_AUTH_WORKER_ARXIV=...
+
+ ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml
+ # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0})
+
+ fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
+
+Things seem to run smoothly in QA. New releases get grouped with old works
+correctly, no duplication obvious.
+
+In prod, loaded just the first file as a start, waiting to see if auto-ingest
+happens. Looks like yes! Great that everything is so smooth. All seem to be new
+captures.
+
+In production prod elasticsearch, 2,377,645 arxiv releases before this
+updated import, 741,033 with files attached. Guessing about 150k new releases,
+but will check.
+
+Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
+781,122 with fulltext.
+