From 052907bf8af22a2638554b719410b10ac1a8f9b6 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 22 Dec 2019 13:33:43 -0800 Subject: arxiv bulk update notes --- notes/bulk_edits/2019-12-20_updates.md | 36 ++++++++++++++++++++++++++++++++++ notes/bulk_edits/CHANGELOG.md | 15 ++++++++++++-- 2 files changed, 49 insertions(+), 2 deletions(-) create mode 100644 notes/bulk_edits/2019-12-20_updates.md (limited to 'notes') diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md new file mode 100644 index 00000000..526a0f02 --- /dev/null +++ b/notes/bulk_edits/2019-12-20_updates.md @@ -0,0 +1,36 @@ + +## Arxiv + +Used metha-sync tool to update. Then went in raw storage directory (as opposed +to using `metha-cat`) and plucked out weekly files updated since last import. +Created a tarball and uploaded to: + + https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz + +Downloaded, extracted, then unzipped: + + gunzip *.gz + +Run importer: + + export FATCAT_AUTH_WORKER_ARXIV=... + + ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml + # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0}) + + fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {} + +Things seem to run smoothly in QA. New releases get grouped with old works +correctly, no duplication obvious. + +In prod, loaded just the first file as a start, waiting to see if auto-ingest +happens. Looks like yes! Great that everything is so smooth. All seem to be new +captures. + +In production prod elasticsearch, 2,377,645 arxiv releases before this +updated import, 741,033 with files attached. Guessing about 150k new releases, +but will check. + +Up to 2,531,542 arxiv releases, so only 154k or so new releases created. +781,122 with fulltext. + diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index 3aa89b87..80760938 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -9,13 +9,24 @@ this file should probably get merged into the guide at some point. This file should not turn in to a TODO list! +## 2019-12 + +Inserted about 154k new arxiv release entities. Still no automatic daily +harvesting. + +"Save Paper Now" importer running. This bot only *submits* editgroups for +review, doesn't auto-accept them. + +## 2019-11 + +Daily ingest of fulltext for OA releases now enabled. New file entities created +and merged automatically. + ## 2019-10 Inserted 1.45m new release entities from Crossref which had been missed during a previous gap in continuous metadata harvesting. -## 2019-10 - Updated 304,308 file entities to remove broken "https://web.archive.org/web/None/*" URLs. -- cgit v1.2.3