diff options
Diffstat (limited to 'extra/bulk_edits/2019-12-20_updates.md')
-rw-r--r-- | extra/bulk_edits/2019-12-20_updates.md | 137 |
1 files changed, 137 insertions, 0 deletions
diff --git a/extra/bulk_edits/2019-12-20_updates.md b/extra/bulk_edits/2019-12-20_updates.md new file mode 100644 index 00000000..bd069a7a --- /dev/null +++ b/extra/bulk_edits/2019-12-20_updates.md @@ -0,0 +1,137 @@ + +## Arxiv + +Used metha-sync tool to update. Then went in raw storage directory (as opposed +to using `metha-cat`) and plucked out weekly files updated since last import. +Created a tarball and uploaded to: + + https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz + +Downloaded, extracted, then unzipped: + + gunzip *.gz + +Run importer: + + export FATCAT_AUTH_WORKER_ARXIV=... + + ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml + # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0}) + + fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {} + +Things seem to run smoothly in QA. New releases get grouped with old works +correctly, no duplication obvious. + +In prod, loaded just the first file as a start, waiting to see if auto-ingest +happens. Looks like yes! Great that everything is so smooth. All seem to be new +captures. + +In production prod elasticsearch, 2,377,645 arxiv releases before this +updated import, 741,033 with files attached. Guessing about 150k new releases, +but will check. + +Up to 2,531,542 arxiv releases, so only 154k or so new releases created. +781,122 with fulltext. + +## Pubmed QA + +Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020> + + gunzip *.xml.gz + +Run importer: + + export FATCAT_AUTH_WORKER_PUBMED=... + + ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3}) + +Noticed that `release_year` was not getting set for many releases. Made a small +code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another: + + time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1}) + + real 30m45.044s + user 16m43.672s + sys 0m10.792s + + time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + +More errors: + + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"} + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"} + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"} + + + 10.1037//0002-9432.72.1.50 + BOGUS DOI: 10.1037//0021-843x.106.2.266 + BOGUS DOI: 10.1037//0021-843x.106.2.280 + => actual ok? at least redirect ok + + unparsable medline date, skipping: Summer 2018 + +TODO: +x fix bad DOI error (real error, skip these) +x remove newline after "unparsable medline date" error +x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning + +NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output. + +## Pubmed Prod (2020-01-17) + +This is after adding a flag to enforce no updates at all, only new releases. +Will likely revisit and run through with updates that add important metadata +like exact references matches for older releases, after doing release +merge/group cleanups. + + + # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec + # had a trivial typo in fatcat_import.py, will push a fix + export FATCAT_AUTH_WORKER_PUBMED=... + time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + +Full run: + + fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + [...] + Command exited with non-zero status 2 + 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k + 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps + + => so apparently 2x tasks failed + => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU + +Only received a single exception at: + + Jan 18, 2020 8:33:09 AM UTC + /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml + MalformedExternalId: 10.4149/gpb¬_2017042 + +Not sure what the other failure was... maybe an invalid filename or argument, +before processing actually started? Or some failure (OOM) that prevented sentry +reporting? + +Patch normal.py and re-run that single file: + + ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + [...] + Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0}) + +Done! + +## Chocula + +Command: + + export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...] + ./fatcat_import.py chocula /srv/fatcat/datasets/export_fatcat.2019-12-26.json + +Result: + + Counter({'total': 144455, 'exists': 139807, 'insert': 2384, 'skip': 2264, 'skip-unknown-new-issnl': 2264, 'exists-by-issnl': 306, 'update': 0}) |