diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2020-12-17 22:07:51 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-12-17 22:08:28 -0800 |
commit | 753407d23dcc507b4f933a0a062ef81f5ffc72da (patch) | |
tree | 6c054222ccc23851d29d21107307abfe44b47205 | |
parent | 60e022609cd3fbbf9634577149018592e680858d (diff) | |
download | fatcat-753407d23dcc507b4f933a0a062ef81f5ffc72da.tar.gz fatcat-753407d23dcc507b4f933a0a062ef81f5ffc72da.zip |
DOAJ import notes
-rw-r--r-- | notes/bulk_edits/2020-12-14_doaj.md | 23 | ||||
-rw-r--r-- | notes/bulk_edits/CHANGELOG.md | 2 |
2 files changed, 23 insertions, 2 deletions
diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md index 7e746082..64a80fda 100644 --- a/notes/bulk_edits/2020-12-14_doaj.md +++ b/notes/bulk_edits/2020-12-14_doaj.md @@ -100,6 +100,25 @@ Will shuffle the entire file, import in a single thread, and just skip importing if there is any fuzzy match (not try to merge/update). Expecting about 500k new releases after such filtering. - # full run (TODO) - zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - +Ok, on 2020-12-17, back with patches to use fuzzycat in filtering. Trying +another batch: + + # git rev: 60e022609cd3fbbf9634577149018592e680858d + # DB before: Size: 678.47G + + export FATCAT_AUTH_WORKER_DOAJ=... + + zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n1000 | tail -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + => Counter({'total': 100, 'exists': 71, 'insert': 19, 'exists-fuzzy': 10, 'skip': 0, 'update': 0}) + + # https://fatcat.wiki/changelog/5033496 + +Sampled 10x of these and they look much better: no obvious duplication. Going +ahead with the full import; note that other ingest is happening in parallel +(many crossref, datacite, and pubmed imports which backed up). + + # full run + # note the shuf command added, in an attempt to reduce duplicates within this corpus + zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + # started 2020-12-17 22:01 (Pacific) diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index bef25e84..5f25d769 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -13,6 +13,8 @@ This file should not turn in to a TODO list! Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities. +Imported DOAJ article metadata from a 2020-11 dump. + ## 2020-03 Started harvesting both Arxiv and Pubmed metadata daily and importing to |