diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:34:02 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:34:02 -0800 |
commit | c32154f2875a7fb9aac727013e1475cdd811e180 (patch) | |
tree | f0e061498a101fa824995fb6ec9f91e7e44257e1 /notes/bulk_edits/2019-12-20_updates.md | |
parent | c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff) | |
download | fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip |
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'notes/bulk_edits/2019-12-20_updates.md')
-rw-r--r-- | notes/bulk_edits/2019-12-20_updates.md | 137 |
1 files changed, 0 insertions, 137 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md deleted file mode 100644 index bd069a7a..00000000 --- a/notes/bulk_edits/2019-12-20_updates.md +++ /dev/null @@ -1,137 +0,0 @@ - -## Arxiv - -Used metha-sync tool to update. Then went in raw storage directory (as opposed -to using `metha-cat`) and plucked out weekly files updated since last import. -Created a tarball and uploaded to: - - https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz - -Downloaded, extracted, then unzipped: - - gunzip *.gz - -Run importer: - - export FATCAT_AUTH_WORKER_ARXIV=... - - ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml - # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0}) - - fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {} - -Things seem to run smoothly in QA. New releases get grouped with old works -correctly, no duplication obvious. - -In prod, loaded just the first file as a start, waiting to see if auto-ingest -happens. Looks like yes! Great that everything is so smooth. All seem to be new -captures. - -In production prod elasticsearch, 2,377,645 arxiv releases before this -updated import, 741,033 with files attached. Guessing about 150k new releases, -but will check. - -Up to 2,531,542 arxiv releases, so only 154k or so new releases created. -781,122 with fulltext. - -## Pubmed QA - -Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020> - - gunzip *.xml.gz - -Run importer: - - export FATCAT_AUTH_WORKER_PUBMED=... - - ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - - # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3}) - -Noticed that `release_year` was not getting set for many releases. Made a small -code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another: - - time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - - # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1}) - - real 30m45.044s - user 16m43.672s - sys 0m10.792s - - time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - -More errors: - - HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"} - HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"} - HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"} - - - 10.1037//0002-9432.72.1.50 - BOGUS DOI: 10.1037//0021-843x.106.2.266 - BOGUS DOI: 10.1037//0021-843x.106.2.280 - => actual ok? at least redirect ok - - unparsable medline date, skipping: Summer 2018 - -TODO: -x fix bad DOI error (real error, skip these) -x remove newline after "unparsable medline date" error -x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning - -NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output. - -## Pubmed Prod (2020-01-17) - -This is after adding a flag to enforce no updates at all, only new releases. -Will likely revisit and run through with updates that add important metadata -like exact references matches for older releases, after doing release -merge/group cleanups. - - - # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec - # had a trivial typo in fatcat_import.py, will push a fix - export FATCAT_AUTH_WORKER_PUBMED=... - time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - -Full run: - - fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - - [...] - Command exited with non-zero status 2 - 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k - 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps - - => so apparently 2x tasks failed - => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU - -Only received a single exception at: - - Jan 18, 2020 8:33:09 AM UTC - /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml - MalformedExternalId: 10.4149/gpb¬_2017042 - -Not sure what the other failure was... maybe an invalid filename or argument, -before processing actually started? Or some failure (OOM) that prevented sentry -reporting? - -Patch normal.py and re-run that single file: - - ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - [...] - Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0}) - -Done! - -## Chocula - -Command: - - export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...] - ./fatcat_import.py chocula /srv/fatcat/datasets/export_fatcat.2019-12-26.json - -Result: - - Counter({'total': 144455, 'exists': 139807, 'insert': 2384, 'skip': 2264, 'skip-unknown-new-issnl': 2264, 'exists-by-issnl': 306, 'update': 0}) |