summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2019-12-23 18:17:46 -0800
committerBryan Newbold <bnewbold@robocracy.org>2019-12-23 18:18:40 -0800
commitaa6bec5993e5444937054022b46d751645d0183c (patch)
treec06ef64fcaef77ac9ea3cd726755f352e51d1281
parentf9ffc6d99c53a2ead4b3486e25f0342b6ebc5cdc (diff)
downloadfatcat-aa6bec5993e5444937054022b46d751645d0183c.tar.gz
fatcat-aa6bec5993e5444937054022b46d751645d0183c.zip
pubmed bulk import notes (from QA)
-rw-r--r--notes/bulk_edits/2019-12-20_updates.md45
1 files changed, 45 insertions, 0 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md
index 526a0f02..c8b1438c 100644
--- a/notes/bulk_edits/2019-12-20_updates.md
+++ b/notes/bulk_edits/2019-12-20_updates.md
@@ -34,3 +34,48 @@ but will check.
Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
781,122 with fulltext.
+## Pubmed
+
+Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>
+
+ gunzip *.xml.gz
+
+Run importer:
+
+ export FATCAT_AUTH_WORKER_PUBMED=...
+
+ ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3})
+
+Noticed that `release_year` was not getting set for many releases. Made a small
+code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another:
+
+ time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1})
+
+ real 30m45.044s
+ user 16m43.672s
+ sys 0m10.792s
+
+ time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+More errors:
+
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"}
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"}
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"}
+
+
+ BOGUS DOI: 10.1037//0021-843x.106.2.266
+ BOGUS DOI: 10.1037//0021-843x.106.2.280
+ => actual ok? at least redirect ok
+
+ unparsable medline date, skipping: Summer 2018
+
+TODO:
+- fix bad DOI error (real error, skip these)
+- remove newline after "unparsable medline date" error
+- remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning
+