diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2019-12-22 13:33:43 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2019-12-22 13:33:43 -0800 |
commit | 052907bf8af22a2638554b719410b10ac1a8f9b6 (patch) | |
tree | 03a59d2e166967e544e3c3a383aefab9eec55e43 /notes/bulk_edits/2019-12-20_updates.md | |
parent | fc6fa5a2d7f24c76d51f9ce2530fed055b20e27f (diff) | |
download | fatcat-052907bf8af22a2638554b719410b10ac1a8f9b6.tar.gz fatcat-052907bf8af22a2638554b719410b10ac1a8f9b6.zip |
arxiv bulk update notes
Diffstat (limited to 'notes/bulk_edits/2019-12-20_updates.md')
-rw-r--r-- | notes/bulk_edits/2019-12-20_updates.md | 36 |
1 files changed, 36 insertions, 0 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md new file mode 100644 index 00000000..526a0f02 --- /dev/null +++ b/notes/bulk_edits/2019-12-20_updates.md @@ -0,0 +1,36 @@ + +## Arxiv + +Used metha-sync tool to update. Then went in raw storage directory (as opposed +to using `metha-cat`) and plucked out weekly files updated since last import. +Created a tarball and uploaded to: + + https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz + +Downloaded, extracted, then unzipped: + + gunzip *.gz + +Run importer: + + export FATCAT_AUTH_WORKER_ARXIV=... + + ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml + # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0}) + + fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {} + +Things seem to run smoothly in QA. New releases get grouped with old works +correctly, no duplication obvious. + +In prod, loaded just the first file as a start, waiting to see if auto-ingest +happens. Looks like yes! Great that everything is so smooth. All seem to be new +captures. + +In production prod elasticsearch, 2,377,645 arxiv releases before this +updated import, 741,033 with files attached. Guessing about 150k new releases, +but will check. + +Up to 2,531,542 arxiv releases, so only 154k or so new releases created. +781,122 with fulltext. + |