diff options
Diffstat (limited to 'notes/bulk_edits')
-rw-r--r-- | notes/bulk_edits/2019-12-20_updates.md | 82 | ||||
-rw-r--r-- | notes/bulk_edits/CHANGELOG.md | 15 |
2 files changed, 95 insertions, 2 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md new file mode 100644 index 00000000..a8f62ea9 --- /dev/null +++ b/notes/bulk_edits/2019-12-20_updates.md @@ -0,0 +1,82 @@ + +## Arxiv + +Used metha-sync tool to update. Then went in raw storage directory (as opposed +to using `metha-cat`) and plucked out weekly files updated since last import. +Created a tarball and uploaded to: + + https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz + +Downloaded, extracted, then unzipped: + + gunzip *.gz + +Run importer: + + export FATCAT_AUTH_WORKER_ARXIV=... + + ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml + # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0}) + + fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {} + +Things seem to run smoothly in QA. New releases get grouped with old works +correctly, no duplication obvious. + +In prod, loaded just the first file as a start, waiting to see if auto-ingest +happens. Looks like yes! Great that everything is so smooth. All seem to be new +captures. + +In production prod elasticsearch, 2,377,645 arxiv releases before this +updated import, 741,033 with files attached. Guessing about 150k new releases, +but will check. + +Up to 2,531,542 arxiv releases, so only 154k or so new releases created. +781,122 with fulltext. + +## Pubmed + +Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020> + + gunzip *.xml.gz + +Run importer: + + export FATCAT_AUTH_WORKER_PUBMED=... + + ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3}) + +Noticed that `release_year` was not getting set for many releases. Made a small +code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another: + + time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1}) + + real 30m45.044s + user 16m43.672s + sys 0m10.792s + + time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + +More errors: + + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"} + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"} + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"} + + + 10.1037//0002-9432.72.1.50 + BOGUS DOI: 10.1037//0021-843x.106.2.266 + BOGUS DOI: 10.1037//0021-843x.106.2.280 + => actual ok? at least redirect ok + + unparsable medline date, skipping: Summer 2018 + +TODO: +x fix bad DOI error (real error, skip these) +x remove newline after "unparsable medline date" error +x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning + diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index 3aa89b87..80760938 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -9,13 +9,24 @@ this file should probably get merged into the guide at some point. This file should not turn in to a TODO list! +## 2019-12 + +Inserted about 154k new arxiv release entities. Still no automatic daily +harvesting. + +"Save Paper Now" importer running. This bot only *submits* editgroups for +review, doesn't auto-accept them. + +## 2019-11 + +Daily ingest of fulltext for OA releases now enabled. New file entities created +and merged automatically. + ## 2019-10 Inserted 1.45m new release entities from Crossref which had been missed during a previous gap in continuous metadata harvesting. -## 2019-10 - Updated 304,308 file entities to remove broken "https://web.archive.org/web/None/*" URLs. |