From 0742d0904166192ed48cd83e604a4d95246dfa47 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 19 Jan 2020 09:52:17 -0800 Subject: pubmed update notes --- notes/bulk_edits/2019-12-20_updates.md | 47 +++++++++++++++++++++++++++++++++- 1 file changed, 46 insertions(+), 1 deletion(-) diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md index 83c8d9da..bd069a7a 100644 --- a/notes/bulk_edits/2019-12-20_updates.md +++ b/notes/bulk_edits/2019-12-20_updates.md @@ -34,7 +34,7 @@ but will check. Up to 2,531,542 arxiv releases, so only 154k or so new releases created. 781,122 with fulltext. -## Pubmed +## Pubmed QA Grabbed fresh 2020 baseline, released in December 2019: @@ -80,6 +80,51 @@ x fix bad DOI error (real error, skip these) x remove newline after "unparsable medline date" error x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning +NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output. + +## Pubmed Prod (2020-01-17) + +This is after adding a flag to enforce no updates at all, only new releases. +Will likely revisit and run through with updates that add important metadata +like exact references matches for older releases, after doing release +merge/group cleanups. + + + # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec + # had a trivial typo in fatcat_import.py, will push a fix + export FATCAT_AUTH_WORKER_PUBMED=... + time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + +Full run: + + fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + [...] + Command exited with non-zero status 2 + 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k + 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps + + => so apparently 2x tasks failed + => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU + +Only received a single exception at: + + Jan 18, 2020 8:33:09 AM UTC + /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml + MalformedExternalId: 10.4149/gpb¬_2017042 + +Not sure what the other failure was... maybe an invalid filename or argument, +before processing actually started? Or some failure (OOM) that prevented sentry +reporting? + +Patch normal.py and re-run that single file: + + ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + [...] + Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0}) + +Done! + ## Chocula Command: -- cgit v1.2.3