diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2019-12-23 18:17:46 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2019-12-23 18:18:40 -0800 |
commit | aa6bec5993e5444937054022b46d751645d0183c (patch) | |
tree | c06ef64fcaef77ac9ea3cd726755f352e51d1281 | |
parent | f9ffc6d99c53a2ead4b3486e25f0342b6ebc5cdc (diff) | |
download | fatcat-aa6bec5993e5444937054022b46d751645d0183c.tar.gz fatcat-aa6bec5993e5444937054022b46d751645d0183c.zip |
pubmed bulk import notes (from QA)
-rw-r--r-- | notes/bulk_edits/2019-12-20_updates.md | 45 |
1 files changed, 45 insertions, 0 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md index 526a0f02..c8b1438c 100644 --- a/notes/bulk_edits/2019-12-20_updates.md +++ b/notes/bulk_edits/2019-12-20_updates.md @@ -34,3 +34,48 @@ but will check. Up to 2,531,542 arxiv releases, so only 154k or so new releases created. 781,122 with fulltext. +## Pubmed + +Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020> + + gunzip *.xml.gz + +Run importer: + + export FATCAT_AUTH_WORKER_PUBMED=... + + ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3}) + +Noticed that `release_year` was not getting set for many releases. Made a small +code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another: + + time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1}) + + real 30m45.044s + user 16m43.672s + sys 0m10.792s + + time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + +More errors: + + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"} + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"} + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"} + + + BOGUS DOI: 10.1037//0021-843x.106.2.266 + BOGUS DOI: 10.1037//0021-843x.106.2.280 + => actual ok? at least redirect ok + + unparsable medline date, skipping: Summer 2018 + +TODO: +- fix bad DOI error (real error, skip these) +- remove newline after "unparsable medline date" error +- remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning + |