aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2019-12-20_updates.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/bulk_edits/2019-12-20_updates.md')
-rw-r--r--notes/bulk_edits/2019-12-20_updates.md137
1 files changed, 0 insertions, 137 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md
deleted file mode 100644
index bd069a7a..00000000
--- a/notes/bulk_edits/2019-12-20_updates.md
+++ /dev/null
@@ -1,137 +0,0 @@
-
-## Arxiv
-
-Used metha-sync tool to update. Then went in raw storage directory (as opposed
-to using `metha-cat`) and plucked out weekly files updated since last import.
-Created a tarball and uploaded to:
-
- https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz
-
-Downloaded, extracted, then unzipped:
-
- gunzip *.gz
-
-Run importer:
-
- export FATCAT_AUTH_WORKER_ARXIV=...
-
- ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml
- # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0})
-
- fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
-
-Things seem to run smoothly in QA. New releases get grouped with old works
-correctly, no duplication obvious.
-
-In prod, loaded just the first file as a start, waiting to see if auto-ingest
-happens. Looks like yes! Great that everything is so smooth. All seem to be new
-captures.
-
-In production prod elasticsearch, 2,377,645 arxiv releases before this
-updated import, 741,033 with files attached. Guessing about 150k new releases,
-but will check.
-
-Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
-781,122 with fulltext.
-
-## Pubmed QA
-
-Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>
-
- gunzip *.xml.gz
-
-Run importer:
-
- export FATCAT_AUTH_WORKER_PUBMED=...
-
- ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
- # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3})
-
-Noticed that `release_year` was not getting set for many releases. Made a small
-code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another:
-
- time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
- # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1})
-
- real 30m45.044s
- user 16m43.672s
- sys 0m10.792s
-
- time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
-More errors:
-
- HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"}
- HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"}
- HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"}
-
-
- 10.1037//0002-9432.72.1.50
- BOGUS DOI: 10.1037//0021-843x.106.2.266
- BOGUS DOI: 10.1037//0021-843x.106.2.280
- => actual ok? at least redirect ok
-
- unparsable medline date, skipping: Summer 2018
-
-TODO:
-x fix bad DOI error (real error, skip these)
-x remove newline after "unparsable medline date" error
-x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning
-
-NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output.
-
-## Pubmed Prod (2020-01-17)
-
-This is after adding a flag to enforce no updates at all, only new releases.
-Will likely revisit and run through with updates that add important metadata
-like exact references matches for older releases, after doing release
-merge/group cleanups.
-
-
- # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec
- # had a trivial typo in fatcat_import.py, will push a fix
- export FATCAT_AUTH_WORKER_PUBMED=...
- time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
-Full run:
-
- fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
- [...]
- Command exited with non-zero status 2
- 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k
- 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps
-
- => so apparently 2x tasks failed
- => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU
-
-Only received a single exception at:
-
- Jan 18, 2020 8:33:09 AM UTC
- /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml
- MalformedExternalId: 10.4149/gpb¬_2017042
-
-Not sure what the other failure was... maybe an invalid filename or argument,
-before processing actually started? Or some failure (OOM) that prevented sentry
-reporting?
-
-Patch normal.py and re-run that single file:
-
- ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
- [...]
- Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0})
-
-Done!
-
-## Chocula
-
-Command:
-
- export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
- ./fatcat_import.py chocula /srv/fatcat/datasets/export_fatcat.2019-12-26.json
-
-Result:
-
- Counter({'total': 144455, 'exists': 139807, 'insert': 2384, 'skip': 2264, 'skip-unknown-new-issnl': 2264, 'exists-by-issnl': 306, 'update': 0})