summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--fatcat-openapi2.yml4
-rw-r--r--notes/bulk_edits/2019-12-20_updates.md47
-rw-r--r--notes/bulk_edits/2020_datacite.md152
-rw-r--r--notes/bulk_edits/CHANGELOG.md7
-rw-r--r--python/fatcat_tools/normal.py9
5 files changed, 216 insertions, 3 deletions
diff --git a/fatcat-openapi2.yml b/fatcat-openapi2.yml
index 0cf2fb9b..5e54cc13 100644
--- a/fatcat-openapi2.yml
+++ b/fatcat-openapi2.yml
@@ -1442,7 +1442,7 @@ paths:
Create a single Container entity as part of an existing editgroup.
Editgroup must be mutable (aka, not accepted) and editor must have
- permission (aka, have created the editgrou p or have `admin` role).
+ permission (aka, have created the editgroup or have `admin` role).
parameters:
- name: entity
in: body
@@ -1533,7 +1533,7 @@ paths:
description: |
Updates an existing entity as part of a specific (existing) editgroup.
The editgroup must be open for updates (aka, not accepted/merged), and
- the editor making the requiest must have permissions (aka, must have
+ the editor making the request must have permissions (aka, must have
created the editgroup or have `admin` role).
This method can also be used to update an existing entity edit as part
diff --git a/notes/bulk_edits/2019-12-20_updates.md b/notes/bulk_edits/2019-12-20_updates.md
index 83c8d9da..bd069a7a 100644
--- a/notes/bulk_edits/2019-12-20_updates.md
+++ b/notes/bulk_edits/2019-12-20_updates.md
@@ -34,7 +34,7 @@ but will check.
Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
781,122 with fulltext.
-## Pubmed
+## Pubmed QA
Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>
@@ -80,6 +80,51 @@ x fix bad DOI error (real error, skip these)
x remove newline after "unparsable medline date" error
x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning
+NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output.
+
+## Pubmed Prod (2020-01-17)
+
+This is after adding a flag to enforce no updates at all, only new releases.
+Will likely revisit and run through with updates that add important metadata
+like exact references matches for older releases, after doing release
+merge/group cleanups.
+
+
+ # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec
+ # had a trivial typo in fatcat_import.py, will push a fix
+ export FATCAT_AUTH_WORKER_PUBMED=...
+ time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+Full run:
+
+ fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+ [...]
+ Command exited with non-zero status 2
+ 1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k
+ 486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps
+
+ => so apparently 2x tasks failed
+ => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU
+
+Only received a single exception at:
+
+ Jan 18, 2020 8:33:09 AM UTC
+ /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml
+ MalformedExternalId: 10.4149/gpb¬_2017042
+
+Not sure what the other failure was... maybe an invalid filename or argument,
+before processing actually started? Or some failure (OOM) that prevented sentry
+reporting?
+
+Patch normal.py and re-run that single file:
+
+ ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ [...]
+ Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0})
+
+Done!
+
## Chocula
Command:
diff --git a/notes/bulk_edits/2020_datacite.md b/notes/bulk_edits/2020_datacite.md
new file mode 100644
index 00000000..005841ae
--- /dev/null
+++ b/notes/bulk_edits/2020_datacite.md
@@ -0,0 +1,152 @@
+
+
+## QA Runs
+
+Trying on 2019-12-22, using Martin commit 18d411087007a30fbf027b87e30de42344119f0c from 2019-12-20.
+
+Quick test:
+
+ # this branch adds some new deps, so make sure to install them
+ pipenv install --deploy --dev
+ pipenv shell
+ export FATCAT_AUTH_WORKER_DATACITE="..."
+ xzcat /srv/fatcat/datasets/datacite.ndjson.xz | head -n100 | ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
+
+ISSUE: `--extid-map-file` not passed through, so drop the:
+
+ --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
+
+ISSUE: auth_var should be FATCAT_AUTH_WORKER_DATACITE
+
+Test full parallel command:
+
+ export FATCAT_AUTH_WORKER_DATACITE="..."
+ time xzcat /srv/fatcat/datasets/datacite.ndjson.xz | head -n10000 | parallel -j20 --round-robin --pipe ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
+
+ real 0m30.017s
+ user 3m5.576s
+ sys 0m19.640s
+
+Whole lot of:
+
+ invalid literal for int() with base 10: '10,495'
+ invalid literal for int() with base 10: '11,129'
+
+ invalid literal for int() with base 10: 'n/a'
+ invalid literal for int() with base 10: 'n/a'
+
+ invalid literal for int() with base 10: 'OP98'
+ invalid literal for int() with base 10: 'OP208'
+
+ no mapped type: None
+ no mapped type: None
+ no mapped type: None
+
+Re-ran above:
+
+ real 0m27.764s
+ user 3m2.448s
+ sys 0m12.908s
+
+Compare with `--lang-detect`:
+
+ real 0m27.395s
+ user 3m5.620s
+ sys 0m13.344s
+
+Not noticable?
+
+Whole run:
+
+ export FATCAT_AUTH_WORKER_DATACITE="..."
+ time xzcat /srv/fatcat/datasets/datacite.ndjson.xz | parallel -j20 --round-robin --pipe ./fatcat_import.py datacite - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
+
+ real 35m21.051s
+ user 98m57.448s
+ sys 7m9.416s
+
+Huh. Kind of suspiciously fast.
+
+ select count(*) from editgroup where editor_id='07445cd2-cab2-4da5-9f84-34588b7296aa';
+ => 9952 editgroups
+
+ select count(*) from release_edit inner join editgroup on release_edit.editgroup_id = editgroup.id where editgroup.editor_id='07445cd2-cab2-4da5-9f84-34588b7296aa';
+ => 496,342 edits
+
+While running:
+
+ starting around 5k TPS in pg_activity
+ starting size: 367.58G
+ (this is after arxiv and some other changes on top of 2019-12-13 dump)
+ host doing a load average of about 5.5; fatcatd at 115% CPU
+
+ ending size: 371.43G
+
+Actually seems like extremely few DOIs getting inserted? Hrm.
+
+ xzcat /srv/fatcat/datasets/datacite.ndjson.xz | wc -l
+ => 18,210,075
+
+Last DOIs inserted were around: 10.7916/d81v6rqr
+
+Suspect a bunch of errors or something and output getting mangled by all the
+logging? Squelched logging and running again (using same DB/config), except
+with `pv -l` inserted after `xzcat`.
+
+Seem to run at a couple hundred records a second (very volatile).
+
+ Counter({'total': 42919, 'insert': 21579, 'exists': 21334, 'skip': 6, 'skip-blank-title': 6, 'inserted.container': 1, 'update': 0})
+ Counter({'total': 43396, 'insert': 23274, 'exists': 20120, 'skip-blank-title': 2, 'skip': 2, 'update': 0})
+
+Ok! The actual errors:
+
+
+ Traceback (most recent call last):
+ File "./fatcat_import.py", line 507, in <module>
+ main()
+ File "./fatcat_import.py", line 504, in main
+ args.func(args)
+ File "./fatcat_import.py", line 182, in run_datacite
+ JsonLinePusher(dci, args.json_file).run()
+ File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 559, in run
+ self.importer.push_record(record)
+ File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 318, in push_record
+ entity = self.parse_record(raw_record)
+ File "/srv/fatcat/src/python/fatcat_tools/importers/datacite.py", line 447, in parse_record
+ sha1 = hashlib.sha1(text.encode('utf-8')).hexdigest()
+ AttributeError: 'list' object has no attribute 'encode'
+
+ fatcat_openapi_client.exceptions.ApiException: (400)
+ Reason: Bad Request
+ HTTP response headers: HTTPHeaderDict({'Content-Length': '186', 'Content-Type': 'application/json', 'Date': 'Mon, 23 Dec 2019 08:12:16 GMT', 'X-Clacks-Overhead': 'GNU aaronsw, jpb', 'X-Span-ID': '73b0b698-bf88-4721-b869-b322dbe90cbe'})
+ HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.17167/mksz.2017.2.129–155"}
+
+
+ Traceback (most recent call last):
+ File "./fatcat_import.py", line 507, in <module>
+ main()
+ File "./fatcat_import.py", line 504, in main
+ args.func(args)
+ File "./fatcat_import.py", line 182, in run_datacite
+ JsonLinePusher(dci, args.json_file).run()
+ File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 559, in run
+ self.importer.push_record(record)
+ File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 318, in push_record
+ entity = self.parse_record(raw_record)
+ File "/srv/fatcat/src/python/fatcat_tools/importers/datacite.py", line 447, in parse_record
+ sha1 = hashlib.sha1(text.encode('utf-8')).hexdigest()
+ AttributeError: 'list' object has no attribute 'encode'
+
+
+ fatcat_openapi_client.exceptions.ApiException: (400)
+ Reason: Bad Request
+ HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Span-ID': 'ca141ff4-83f7-4ee5-9256-91b23ec09e94', 'Content-Length': '188', 'X-Clacks-Overhead': 'GNU aaronsw, jpb', 'Date': 'Mon, 23 Dec 2019 08:11:25 GMT'})
+ HTTP response body: {"success":false,"error":"ConstraintViolation","message":"unexpected database error: new row for relation \"release_contrib\" violates check constraint \"release_contrib_raw_name_check\""}
+
+## Prod Import
+
+Around first/second week of january. Needed to restart at least once due to
+database deadlock on abstract inserts, which seems to be due to parallelism and
+duplicated records in the bulk datacite dump.
+
+TODO: specific command used by martin
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
index 2db0c72d..172528da 100644
--- a/notes/bulk_edits/CHANGELOG.md
+++ b/notes/bulk_edits/CHANGELOG.md
@@ -14,6 +14,13 @@ This file should not turn in to a TODO list!
Imported around 2,500 new containers (journals, by ISSN-L) from chocula
analysis script.
+Imported DOIs from Datacite (around 16 million, plus or minus a couple
+million).
+
+Imported new release entities from 2020 Pubmed/MEDLINE baseline. This import
+included only new Pubmed works cataloged in 2019 (up until December or so).
+Only a few hundred thousand new release entities.
+
## 2019-12
Started continuous harvesting Datacite DOI metadata; first date harvested was
diff --git a/python/fatcat_tools/normal.py b/python/fatcat_tools/normal.py
index 7b4bd19c..7a2b5fd9 100644
--- a/python/fatcat_tools/normal.py
+++ b/python/fatcat_tools/normal.py
@@ -40,6 +40,14 @@ def clean_doi(raw):
raw = raw[11:]
if raw[7:9] == "//":
raw = raw[:8] + raw[9:]
+
+ # fatcatd uses same REGEX, but Rust regex rejects these characters, while
+ # python doesn't. DOIs are syntaxtually valid, but very likely to be typos;
+ # for now filter them out.
+ for c in ('¬', ):
+ if c in raw:
+ return None
+
if not raw.startswith("10."):
return None
if not DOI_REGEX.fullmatch(raw):
@@ -56,6 +64,7 @@ def test_clean_doi():
assert clean_doi("https://dx.doi.org/10.1234/asdf ") == "10.1234/asdf"
assert clean_doi("doi:10.1234/asdf ") == "10.1234/asdf"
assert clean_doi("doi:10.1234/ asdf ") == None
+ assert clean_doi("10.4149/gpb¬_2017042") == None # "logical negation" character
ARXIV_ID_REGEX = re.compile("^(\d{4}.\d{4,5}|[a-z\-]+(\.[A-Z]{2})?/\d{7})(v\d+)?$")