diff options
Diffstat (limited to 'notes')
| -rw-r--r-- | notes/bulk_edits/2019-12-20_orcid.md | 43 | ||||
| -rw-r--r-- | notes/bulk_edits/CHANGELOG.md | 19 | 
2 files changed, 49 insertions, 13 deletions
| diff --git a/notes/bulk_edits/2019-12-20_orcid.md b/notes/bulk_edits/2019-12-20_orcid.md new file mode 100644 index 00000000..33dde32f --- /dev/null +++ b/notes/bulk_edits/2019-12-20_orcid.md @@ -0,0 +1,43 @@ + +Newer ORCID dumps are XML, not JSON. But there is a conversion tool! + +    https://github.com/ORCID/orcid-conversion-lib + +Commands: + +    wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar +    java -jar orcid-conversion-lib-0.0.2-full.jar OPTIONS + +    java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2019_summaries.tar.gz -v v3_0rc1 -o ORCID_2019_summaries_json.tar.gz + +    # [...] +    # Sat Dec 21 04:43:50 UTC 2019 done 7300000 +    # Sat Dec 21 04:44:08 UTC 2019 done 7310000 +    # Sat Dec 21 04:44:17 UTC 2019 finished  errors 0 + +Importing in QA, ran in to some lines like: + +    {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-0014-6598","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"} +    {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-3750-5654","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"} +    {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-1424-4826","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"} +    {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0002-5340-9665","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"} + +Needed to patch to filter those out. Then ran ok like: + +    zcat /srv/fatcat/datasets/ORCID_2019_summaries.sample_10k.json.gz | ./fatcat_import.py orcid - +    Counter({'total': 10000, 'exists': 5323, 'insert': 4493, 'skip': 184, 'skip-no-person': 160, 'update': 0}) + +New dump is about 7.3 million rows, so expecting about 3.2 million new +entities, 250k skips. + +Doing bulk run like: + +    time zcat /srv/fatcat/datasets/ORCID_2019_summaries.json.gz | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid - + +Prod timing: + +    Counter({'total': 910643, 'exists': 476812, 'insert': 416583, 'skip': 17248, 'update': 0}) + +    real    47m27.658s +    user    245m44.272s +    sys     14m50.836s diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index 80760938..773d09ef 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -11,6 +11,12 @@ This file should not turn in to a TODO list!  ## 2019-12 +Started continuous harvesting Datacite DOI metadata; first date harvested was +`2019-12-13`. No importer running yet. + +Imported about 3.3m new ORCID identifiers from 2019 bulk dump (after converting +from XML to JSON): <https://archive.org/details/orcid-dump-2019> +  Inserted about 154k new arxiv release entities. Still no automatic daily  harvesting. @@ -45,22 +51,9 @@ invalid ISSN checksum).  Imported files (matched to releases by DOI) from Semantic Scholar  (`DIRECT-OA-CRAWL-2019` crawl). -    Arabesque importer -    crawl-bot -    `s2_doi.sqlite` -    TODO: archive.org link -    TODO: rough count -    TODO: date -  Imported files (matched to releases by DOI) from pre-1923/pre-1909 items uploaded  by a user to archive.org. -    Matched importer -    internetarchive-bot (TODO:) -    TODO: archive.org link -    TODO: counts -    TODO: date -  Imported files (matched to releases by DOI) from CORE.ac.uk  (`DIRECT-OA-CRAWL-2019` crawl). | 
