diff options
Diffstat (limited to 'notes')
-rw-r--r-- | notes/bulk_edits/2019-12-20_updates.md | 47 | ||||
-rw-r--r-- | notes/bulk_edits/2020_datacite.md | 152 | ||||
-rw-r--r-- | notes/bulk_edits/CHANGELOG.md | 7 |
3 files changed, 205 insertions, 1 deletions
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md index 83c8d9da..bd069a7a 100644 --- a/notes/bulk_edits/2019-12-20_updates.md +++ b/notes/bulk_edits/2019-12-20_updates.md @@ -34,7 +34,7 @@ but will check. Up to 2,531,542 arxiv releases, so only 154k or so new releases created. 781,122 with fulltext. -## Pubmed +## Pubmed QA Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020> @@ -80,6 +80,51 @@ x fix bad DOI error (real error, skip these) x remove newline after "unparsable medline date" error x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning +NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output. + +## Pubmed Prod (2020-01-17) + +This is after adding a flag to enforce no updates at all, only new releases. +Will likely revisit and run through with updates that add important metadata +like exact references matches for older releases, after doing release +merge/group cleanups. + + + # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec + # had a trivial typo in fatcat_import.py, will push a fix + export FATCAT_AUTH_WORKER_PUBMED=... + time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + +Full run: + + fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + + [...] + Command exited with non-zero status 2 + 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k + 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps + + => so apparently 2x tasks failed + => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU + +Only received a single exception at: + + Jan 18, 2020 8:33:09 AM UTC + /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml + MalformedExternalId: 10.4149/gpb¬_2017042 + +Not sure what the other failure was... maybe an invalid filename or argument, +before processing actually started? Or some failure (OOM) that prevented sentry +reporting? + +Patch normal.py and re-run that single file: + + ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt + [...] + Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0}) + +Done! + ## Chocula Command: diff --git a/notes/bulk_edits/2020_datacite.md b/notes/bulk_edits/2020_datacite.md new file mode 100644 index 00000000..005841ae --- /dev/null +++ b/notes/bulk_edits/2020_datacite.md @@ -0,0 +1,152 @@ + + +## QA Runs + +Trying on 2019-12-22, using Martin commit 18d411087007a30fbf027b87e30de42344119f0c from 2019-12-20. + +Quick test: + + # this branch adds some new deps, so make sure to install them + pipenv install --deploy --dev + pipenv shell + export FATCAT_AUTH_WORKER_DATACITE="..." + xzcat /srv/fatcat/datasets/datacite.ndjson.xz | head -n100 | ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 + +ISSUE: `--extid-map-file` not passed through, so drop the: + + --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 + +ISSUE: auth_var should be FATCAT_AUTH_WORKER_DATACITE + +Test full parallel command: + + export FATCAT_AUTH_WORKER_DATACITE="..." + time xzcat /srv/fatcat/datasets/datacite.ndjson.xz | head -n10000 | parallel -j20 --round-robin --pipe ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 + + real 0m30.017s + user 3m5.576s + sys 0m19.640s + +Whole lot of: + + invalid literal for int() with base 10: '10,495' + invalid literal for int() with base 10: '11,129' + + invalid literal for int() with base 10: 'n/a' + invalid literal for int() with base 10: 'n/a' + + invalid literal for int() with base 10: 'OP98' + invalid literal for int() with base 10: 'OP208' + + no mapped type: None + no mapped type: None + no mapped type: None + +Re-ran above: + + real 0m27.764s + user 3m2.448s + sys 0m12.908s + +Compare with `--lang-detect`: + + real 0m27.395s + user 3m5.620s + sys 0m13.344s + +Not noticable? + +Whole run: + + export FATCAT_AUTH_WORKER_DATACITE="..." + time xzcat /srv/fatcat/datasets/datacite.ndjson.xz | parallel -j20 --round-robin --pipe ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 + + real 35m21.051s + user 98m57.448s + sys 7m9.416s + +Huh. Kind of suspiciously fast. + + select count(*) from editgroup where editor_id='07445cd2-cab2-4da5-9f84-34588b7296aa'; + => 9952 editgroups + + select count(*) from release_edit inner join editgroup on release_edit.editgroup_id = editgroup.id where editgroup.editor_id='07445cd2-cab2-4da5-9f84-34588b7296aa'; + => 496,342 edits + +While running: + + starting around 5k TPS in pg_activity + starting size: 367.58G + (this is after arxiv and some other changes on top of 2019-12-13 dump) + host doing a load average of about 5.5; fatcatd at 115% CPU + + ending size: 371.43G + +Actually seems like extremely few DOIs getting inserted? Hrm. + + xzcat /srv/fatcat/datasets/datacite.ndjson.xz | wc -l + => 18,210,075 + +Last DOIs inserted were around: 10.7916/d81v6rqr + +Suspect a bunch of errors or something and output getting mangled by all the +logging? Squelched logging and running again (using same DB/config), except +with `pv -l` inserted after `xzcat`. + +Seem to run at a couple hundred records a second (very volatile). + + Counter({'total': 42919, 'insert': 21579, 'exists': 21334, 'skip': 6, 'skip-blank-title': 6, 'inserted.container': 1, 'update': 0}) + Counter({'total': 43396, 'insert': 23274, 'exists': 20120, 'skip-blank-title': 2, 'skip': 2, 'update': 0}) + +Ok! The actual errors: + + + Traceback (most recent call last): + File "./fatcat_import.py", line 507, in <module> + main() + File "./fatcat_import.py", line 504, in main + args.func(args) + File "./fatcat_import.py", line 182, in run_datacite + JsonLinePusher(dci, args.json_file).run() + File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 559, in run + self.importer.push_record(record) + File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 318, in push_record + entity = self.parse_record(raw_record) + File "/srv/fatcat/src/python/fatcat_tools/importers/datacite.py", line 447, in parse_record + sha1 = hashlib.sha1(text.encode('utf-8')).hexdigest() + AttributeError: 'list' object has no attribute 'encode' + + fatcat_openapi_client.exceptions.ApiException: (400) + Reason: Bad Request + HTTP response headers: HTTPHeaderDict({'Content-Length': '186', 'Content-Type': 'application/json', 'Date': 'Mon, 23 Dec 2019 08:12:16 GMT', 'X-Clacks-Overhead': 'GNU aaronsw, jpb', 'X-Span-ID': '73b0b698-bf88-4721-b869-b322dbe90cbe'}) + HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.17167/mksz.2017.2.129–155"} + + + Traceback (most recent call last): + File "./fatcat_import.py", line 507, in <module> + main() + File "./fatcat_import.py", line 504, in main + args.func(args) + File "./fatcat_import.py", line 182, in run_datacite + JsonLinePusher(dci, args.json_file).run() + File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 559, in run + self.importer.push_record(record) + File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 318, in push_record + entity = self.parse_record(raw_record) + File "/srv/fatcat/src/python/fatcat_tools/importers/datacite.py", line 447, in parse_record + sha1 = hashlib.sha1(text.encode('utf-8')).hexdigest() + AttributeError: 'list' object has no attribute 'encode' + + + fatcat_openapi_client.exceptions.ApiException: (400) + Reason: Bad Request + HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Span-ID': 'ca141ff4-83f7-4ee5-9256-91b23ec09e94', 'Content-Length': '188', 'X-Clacks-Overhead': 'GNU aaronsw, jpb', 'Date': 'Mon, 23 Dec 2019 08:11:25 GMT'}) + HTTP response body: {"success":false,"error":"ConstraintViolation","message":"unexpected database error: new row for relation \"release_contrib\" violates check constraint \"release_contrib_raw_name_check\""} + +## Prod Import + +Around first/second week of january. Needed to restart at least once due to +database deadlock on abstract inserts, which seems to be due to parallelism and +duplicated records in the bulk datacite dump. + +TODO: specific command used by martin diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index 2db0c72d..172528da 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -14,6 +14,13 @@ This file should not turn in to a TODO list! Imported around 2,500 new containers (journals, by ISSN-L) from chocula analysis script. +Imported DOIs from Datacite (around 16 million, plus or minus a couple +million). + +Imported new release entities from 2020 Pubmed/MEDLINE baseline. This import +included only new Pubmed works cataloged in 2019 (up until December or so). +Only a few hundred thousand new release entities. + ## 2019-12 Started continuous harvesting Datacite DOI metadata; first date harvested was |