From 753407d23dcc507b4f933a0a062ef81f5ffc72da Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 17 Dec 2020 22:07:51 -0800 Subject: DOAJ import notes --- notes/bulk_edits/2020-12-14_doaj.md | 23 +++++++++++++++++++++-- notes/bulk_edits/CHANGELOG.md | 2 ++ 2 files changed, 23 insertions(+), 2 deletions(-) diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md index 7e746082..64a80fda 100644 --- a/notes/bulk_edits/2020-12-14_doaj.md +++ b/notes/bulk_edits/2020-12-14_doaj.md @@ -100,6 +100,25 @@ Will shuffle the entire file, import in a single thread, and just skip importing if there is any fuzzy match (not try to merge/update). Expecting about 500k new releases after such filtering. - # full run (TODO) - zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - +Ok, on 2020-12-17, back with patches to use fuzzycat in filtering. Trying +another batch: + + # git rev: 60e022609cd3fbbf9634577149018592e680858d + # DB before: Size: 678.47G + + export FATCAT_AUTH_WORKER_DOAJ=... + + zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n1000 | tail -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + => Counter({'total': 100, 'exists': 71, 'insert': 19, 'exists-fuzzy': 10, 'skip': 0, 'update': 0}) + + # https://fatcat.wiki/changelog/5033496 + +Sampled 10x of these and they look much better: no obvious duplication. Going +ahead with the full import; note that other ingest is happening in parallel +(many crossref, datacite, and pubmed imports which backed up). + + # full run + # note the shuf command added, in an attempt to reduce duplicates within this corpus + zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + # started 2020-12-17 22:01 (Pacific) diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index bef25e84..5f25d769 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -13,6 +13,8 @@ This file should not turn in to a TODO list! Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities. +Imported DOAJ article metadata from a 2020-11 dump. + ## 2020-03 Started harvesting both Arxiv and Pubmed metadata daily and importing to -- cgit v1.2.3