summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2018-06-21 18:24:18 -0700
committerBryan Newbold <bnewbold@robocracy.org>2018-06-21 18:24:18 -0700
commitc7687e259cd003b3737a8bd9dd1ae51bf1f15a1e (patch)
tree428ee45959a9e01fc3179b279c8be5b4790a02a2
parent3075f0ab8853fd97c68d3f0b8086dfa5c863c7f2 (diff)
downloadfatcat-c7687e259cd003b3737a8bd9dd1ae51bf1f15a1e.tar.gz
fatcat-c7687e259cd003b3737a8bd9dd1ae51bf1f15a1e.zip
update import numbers
-rw-r--r--python/README_import.md12
1 files changed, 11 insertions, 1 deletions
diff --git a/python/README_import.md b/python/README_import.md
index 60f91cf2..7301d72e 100644
--- a/python/README_import.md
+++ b/python/README_import.md
@@ -25,9 +25,11 @@ From CSV file:
time ./client.py import-issn /srv/datasets/journal_extra_metadata.csv
+Pretty quick, a few minutes.
+
## ORCID
-Directly from compressed tarball:
+Directly from compressed tarball; takes about 2 hours in production:
tar xf /srv/datasets/public_profiles_API-2.0_2017_10_json.tar.gz -O | jq -c . | grep '"person":' | time parallel -j12 --pipe --round-robin ./client.py import-orcid -
@@ -66,6 +68,14 @@ for ~9k files:
=> 203/second
+For the full batch, on production machine with 12 threads, around 3.8 million records:
+
+ 3550.76 user
+ 190.16 system
+ 1:40:01 elapsed
+
+ => 644/second
+
## Crossref
From compressed: