aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2019-12-20_updates.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
commitc32154f2875a7fb9aac727013e1475cdd811e180 (patch)
treef0e061498a101fa824995fb6ec9f91e7e44257e1 /extra/bulk_edits/2019-12-20_updates.md
parentc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
downloadfatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'extra/bulk_edits/2019-12-20_updates.md')
-rw-r--r--extra/bulk_edits/2019-12-20_updates.md137
1 files changed, 137 insertions, 0 deletions
diff --git a/extra/bulk_edits/2019-12-20_updates.md b/extra/bulk_edits/2019-12-20_updates.md
new file mode 100644
index 00000000..bd069a7a
--- /dev/null
+++ b/extra/bulk_edits/2019-12-20_updates.md
@@ -0,0 +1,137 @@
+
+## Arxiv
+
+Used metha-sync tool to update. Then went in raw storage directory (as opposed
+to using `metha-cat`) and plucked out weekly files updated since last import.
+Created a tarball and uploaded to:
+
+ https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz
+
+Downloaded, extracted, then unzipped:
+
+ gunzip *.gz
+
+Run importer:
+
+ export FATCAT_AUTH_WORKER_ARXIV=...
+
+ ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml
+ # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0})
+
+ fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
+
+Things seem to run smoothly in QA. New releases get grouped with old works
+correctly, no duplication obvious.
+
+In prod, loaded just the first file as a start, waiting to see if auto-ingest
+happens. Looks like yes! Great that everything is so smooth. All seem to be new
+captures.
+
+In production prod elasticsearch, 2,377,645 arxiv releases before this
+updated import, 741,033 with files attached. Guessing about 150k new releases,
+but will check.
+
+Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
+781,122 with fulltext.
+
+## Pubmed QA
+
+Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>
+
+ gunzip *.xml.gz
+
+Run importer:
+
+ export FATCAT_AUTH_WORKER_PUBMED=...
+
+ ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3})
+
+Noticed that `release_year` was not getting set for many releases. Made a small
+code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another:
+
+ time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1})
+
+ real 30m45.044s
+ user 16m43.672s
+ sys 0m10.792s
+
+ time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+More errors:
+
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"}
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"}
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"}
+
+
+ 10.1037//0002-9432.72.1.50
+ BOGUS DOI: 10.1037//0021-843x.106.2.266
+ BOGUS DOI: 10.1037//0021-843x.106.2.280
+ => actual ok? at least redirect ok
+
+ unparsable medline date, skipping: Summer 2018
+
+TODO:
+x fix bad DOI error (real error, skip these)
+x remove newline after "unparsable medline date" error
+x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning
+
+NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output.
+
+## Pubmed Prod (2020-01-17)
+
+This is after adding a flag to enforce no updates at all, only new releases.
+Will likely revisit and run through with updates that add important metadata
+like exact references matches for older releases, after doing release
+merge/group cleanups.
+
+
+ # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec
+ # had a trivial typo in fatcat_import.py, will push a fix
+ export FATCAT_AUTH_WORKER_PUBMED=...
+ time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+Full run:
+
+ fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ [...]
+ Command exited with non-zero status 2
+ 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k
+ 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps
+
+ => so apparently 2x tasks failed
+ => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU
+
+Only received a single exception at:
+
+ Jan 18, 2020 8:33:09 AM UTC
+ /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml
+ MalformedExternalId: 10.4149/gpb¬_2017042
+
+Not sure what the other failure was... maybe an invalid filename or argument,
+before processing actually started? Or some failure (OOM) that prevented sentry
+reporting?
+
+Patch normal.py and re-run that single file:
+
+ ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ [...]
+ Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0})
+
+Done!
+
+## Chocula
+
+Command:
+
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
+ ./fatcat_import.py chocula /srv/fatcat/datasets/export_fatcat.2019-12-26.json
+
+Result:
+
+ Counter({'total': 144455, 'exists': 139807, 'insert': 2384, 'skip': 2264, 'skip-unknown-new-issnl': 2264, 'exists-by-issnl': 306, 'update': 0})