aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-12-17 22:07:51 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-12-17 22:08:28 -0800
commit753407d23dcc507b4f933a0a062ef81f5ffc72da (patch)
tree6c054222ccc23851d29d21107307abfe44b47205 /notes/bulk_edits
parent60e022609cd3fbbf9634577149018592e680858d (diff)
downloadfatcat-753407d23dcc507b4f933a0a062ef81f5ffc72da.tar.gz
fatcat-753407d23dcc507b4f933a0a062ef81f5ffc72da.zip
DOAJ import notes
Diffstat (limited to 'notes/bulk_edits')
-rw-r--r--notes/bulk_edits/2020-12-14_doaj.md23
-rw-r--r--notes/bulk_edits/CHANGELOG.md2
2 files changed, 23 insertions, 2 deletions
diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md
index 7e746082..64a80fda 100644
--- a/notes/bulk_edits/2020-12-14_doaj.md
+++ b/notes/bulk_edits/2020-12-14_doaj.md
@@ -100,6 +100,25 @@ Will shuffle the entire file, import in a single thread, and just skip
importing if there is any fuzzy match (not try to merge/update). Expecting
about 500k new releases after such filtering.
- # full run (TODO)
- zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+Ok, on 2020-12-17, back with patches to use fuzzycat in filtering. Trying
+another batch:
+
+ # git rev: 60e022609cd3fbbf9634577149018592e680858d
+ # DB before: Size: 678.47G
+
+ export FATCAT_AUTH_WORKER_DOAJ=...
+
+ zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n1000 | tail -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ => Counter({'total': 100, 'exists': 71, 'insert': 19, 'exists-fuzzy': 10, 'skip': 0, 'update': 0})
+
+ # https://fatcat.wiki/changelog/5033496
+
+Sampled 10x of these and they look much better: no obvious duplication. Going
+ahead with the full import; note that other ingest is happening in parallel
+(many crossref, datacite, and pubmed imports which backed up).
+
+ # full run
+ # note the shuf command added, in an attempt to reduce duplicates within this corpus
+ zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # started 2020-12-17 22:01 (Pacific)
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
index bef25e84..5f25d769 100644
--- a/notes/bulk_edits/CHANGELOG.md
+++ b/notes/bulk_edits/CHANGELOG.md
@@ -13,6 +13,8 @@ This file should not turn in to a TODO list!
Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.
+Imported DOAJ article metadata from a 2020-11 dump.
+
## 2020-03
Started harvesting both Arxiv and Pubmed metadata daily and importing to