summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-01-19 09:52:17 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-01-19 09:58:18 -0800
commit0742d0904166192ed48cd83e604a4d95246dfa47 (patch)
tree7da1dabd3c546e034161fced98bf8054a15e7a49
parenta73f751d9eea2b4eaca780e940f324c9c07119a2 (diff)
downloadfatcat-0742d0904166192ed48cd83e604a4d95246dfa47.tar.gz
fatcat-0742d0904166192ed48cd83e604a4d95246dfa47.zip
pubmed update notes
-rw-r--r--notes/bulk_edits/2019-12-20_updates.md47
1 files changed, 46 insertions, 1 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md
index 83c8d9da..bd069a7a 100644
--- a/notes/bulk_edits/2019-12-20_updates.md
+++ b/notes/bulk_edits/2019-12-20_updates.md
@@ -34,7 +34,7 @@ but will check.
Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
781,122 with fulltext.
-## Pubmed
+## Pubmed QA
Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>
@@ -80,6 +80,51 @@ x fix bad DOI error (real error, skip these)
x remove newline after "unparsable medline date" error
x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning
+NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output.
+
+## Pubmed Prod (2020-01-17)
+
+This is after adding a flag to enforce no updates at all, only new releases.
+Will likely revisit and run through with updates that add important metadata
+like exact references matches for older releases, after doing release
+merge/group cleanups.
+
+
+ # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec
+ # had a trivial typo in fatcat_import.py, will push a fix
+ export FATCAT_AUTH_WORKER_PUBMED=...
+ time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+Full run:
+
+ fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ [...]
+ Command exited with non-zero status 2
+ 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k
+ 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps
+
+ => so apparently 2x tasks failed
+ => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU
+
+Only received a single exception at:
+
+ Jan 18, 2020 8:33:09 AM UTC
+ /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml
+ MalformedExternalId: 10.4149/gpb¬_2017042
+
+Not sure what the other failure was... maybe an invalid filename or argument,
+before processing actually started? Or some failure (OOM) that prevented sentry
+reporting?
+
+Patch normal.py and re-run that single file:
+
+ ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ [...]
+ Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0})
+
+Done!
+
## Chocula
Command: