importer updates

author: Bryan Newbold <bnewbold@robocracy.org> 2018-06-30 14:47:54 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2018-06-30 14:47:54 -0700
commit: cdc8f987d16a91ac9d54a42c72d714fe8e4842d3 (patch)
tree: 5a4c9f1205e9a6f1e67d5f8caec0666027a88346
parent: 296efbdd615fe4f9b7ad22a71cc2812142c17aee (diff)
download: fatcat-cdc8f987d16a91ac9d54a42c72d714fe8e4842d3.tar.gz
fatcat-cdc8f987d16a91ac9d54a42c72d714fe8e4842d3.zip
1 files changed, 16 insertions, 2 deletions
diff --git a/python/README_import.md b/python/README_import.md
index 7301d72e..f43d9424 100644
--- a/python/README_import.md
+++ b/python/README_import.md
@@ -23,8 +23,13 @@ the others:
 
 From CSV file:
 
+    export LC_ALL=C.UTF-8
     time ./client.py import-issn /srv/datasets/journal_extra_metadata.csv
 
+    real    2m42.148s
+    user    0m11.148s
+    sys     0m0.336s
+
 Pretty quick, a few minutes.
 
 ## ORCID
@@ -33,7 +38,8 @@ Directly from compressed tarball; takes about 2 hours in production:
 
     tar xf /srv/datasets/public_profiles_API-2.0_2017_10_json.tar.gz -O | jq -c . | grep '"person":' | time parallel -j12 --pipe --round-robin ./client.py import-orcid -
 
-Or, from pre-uncompressed tarball:
+After tuning database, `jq` CPU seems to be bottleneck, so, from pre-extracted
+tarball:
 
     tar xf /srv/datasets/public_profiles_API-2.0_2017_10_json.tar.gz -O | jq -c . | rg '"person":' > /srv/datasets/public_profiles_1_2_json.all.json
     time parallel --bar --pipepart -j8 -a /srv/datasets/public_profiles_1_2_json.all.json ./client.py import-orcid -
@@ -76,11 +82,19 @@ For the full batch, on production machine with 12 threads, around 3.8 million re
 
     => 644/second
 
+After some simple database tuning:
+
+    2177.86 user
+    145.60 system
+    56:41.26 elapsed
+
+    => 1117/second
+
 ## Crossref
 
 From compressed:
 
-    xzcat /srv/datasets/crossref-works.2018-01-21.json.xz | time parallel -j12 --round-robin --pipe ./client.py import-crossref - /srv/datasets/20180216.ISSN-to-ISSN-L.txt
+    xzcat /srv/datasets/crossref-works.2018-01-21.json.xz | time parallel -j20 --round-robin --pipe ./client.py import-crossref - /srv/datasets/20180216.ISSN-to-ISSN-L.txt
 
 ## Manifest
author	Bryan Newbold <bnewbold@robocracy.org>	2018-06-30 14:47:54 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2018-06-30 14:47:54 -0700
commit	cdc8f987d16a91ac9d54a42c72d714fe8e4842d3 (patch)
tree	5a4c9f1205e9a6f1e67d5f8caec0666027a88346
parent	296efbdd615fe4f9b7ad22a71cc2812142c17aee (diff)
download	fatcat-cdc8f987d16a91ac9d54a42c72d714fe8e4842d3.tar.gz fatcat-cdc8f987d16a91ac9d54a42c72d714fe8e4842d3.zip