DOAJ import notes

author: Bryan Newbold <bnewbold@robocracy.org> 2020-12-17 22:07:51 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2020-12-17 22:08:28 -0800
commit: 753407d23dcc507b4f933a0a062ef81f5ffc72da (patch)
tree: 6c054222ccc23851d29d21107307abfe44b47205 /notes/bulk_edits
parent: 60e022609cd3fbbf9634577149018592e680858d (diff)
download: fatcat-753407d23dcc507b4f933a0a062ef81f5ffc72da.tar.gz
fatcat-753407d23dcc507b4f933a0a062ef81f5ffc72da.zip
2 files changed, 23 insertions, 2 deletions
diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md
index 7e746082..64a80fda 100644
--- a/notes/bulk_edits/2020-12-14_doaj.md
+++ b/notes/bulk_edits/2020-12-14_doaj.md
@@ -100,6 +100,25 @@ Will shuffle the entire file, import in a single thread, and just skip
 importing if there is any fuzzy match (not try to merge/update). Expecting
 about 500k new releases after such filtering.
 
-    # full run (TODO)
-    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+Ok, on 2020-12-17, back with patches to use fuzzycat in filtering. Trying
+another batch:
+
+    # git rev: 60e022609cd3fbbf9634577149018592e680858d
+    # DB before: Size:  678.47G
+
+    export FATCAT_AUTH_WORKER_DOAJ=...
+
+    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n1000 | tail -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+    => Counter({'total': 100, 'exists': 71, 'insert': 19, 'exists-fuzzy': 10, 'skip': 0, 'update': 0})
+
+    # https://fatcat.wiki/changelog/5033496
+
+Sampled 10x of these and they look much better: no obvious duplication. Going
+ahead with the full import; note that other ingest is happening in parallel
+(many crossref, datacite, and pubmed imports which backed up).
+
+    # full run
+    # note the shuf command added, in an attempt to reduce duplicates within this corpus
+    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
 
+    # started 2020-12-17 22:01 (Pacific)
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
index bef25e84..5f25d769 100644
--- a/notes/bulk_edits/CHANGELOG.md
+++ b/notes/bulk_edits/CHANGELOG.md
@@ -13,6 +13,8 @@ This file should not turn in to a TODO list!
 
 Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.
 
+Imported DOAJ article metadata from a 2020-11 dump.
+
 ## 2020-03
 
 Started harvesting both Arxiv and Pubmed metadata daily and importing to
author	Bryan Newbold <bnewbold@robocracy.org>	2020-12-17 22:07:51 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2020-12-17 22:08:28 -0800
commit	753407d23dcc507b4f933a0a062ef81f5ffc72da (patch)
tree	6c054222ccc23851d29d21107307abfe44b47205 /notes/bulk_edits
parent	60e022609cd3fbbf9634577149018592e680858d (diff)
download	fatcat-753407d23dcc507b4f933a0a062ef81f5ffc72da.tar.gz fatcat-753407d23dcc507b4f933a0a062ef81f5ffc72da.zip