aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-12-23 11:07:41 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-12-23 11:07:41 -0800
commitd33a37eab50e95ceabadf7bbc20088ad62669564 (patch)
tree87f9e6d83808abebc7dd647dcff5fa3f7290f139 /notes/bulk_edits
parent3031aa414932b39f38a6456df2a6f55f0e72dfbe (diff)
downloadfatcat-d33a37eab50e95ceabadf7bbc20088ad62669564.tar.gz
fatcat-d33a37eab50e95ceabadf7bbc20088ad62669564.zip
DOAJ import notes, and SQL/stats update
Diffstat (limited to 'notes/bulk_edits')
-rw-r--r--notes/bulk_edits/2020-12-14_doaj.md15
1 files changed, 15 insertions, 0 deletions
diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md
index 64a80fda..5e897183 100644
--- a/notes/bulk_edits/2020-12-14_doaj.md
+++ b/notes/bulk_edits/2020-12-14_doaj.md
@@ -122,3 +122,18 @@ ahead with the full import; note that other ingest is happening in parallel
zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
# started 2020-12-17 22:01 (Pacific)
+
+ => 5.45M 52:38:45 [28.8 /s]
+ => Counter({'total': 1366458, 'exists': 1020295, 'insert': 200249, 'exists-fuzzy': 144334, 'skip': 1563, 'skip-title': 1563, 'skip-doaj-id-mismatch': 17, 'update': 0})
+
+As total estimates:
+
+- total: 5,465,832
+- exists: 4,081,180
+- exists-fuzzy: 577,336
+- insert: 800,996
+
+Ending database size: Size: 684.08G
+
+(note that regular imports were running during same period)
+