aboutsummaryrefslogtreecommitdiffstats
path: root/python/README_import.md
diff options
context:
space:
mode:
Diffstat (limited to 'python/README_import.md')
-rw-r--r--python/README_import.md20
1 files changed, 10 insertions, 10 deletions
diff --git a/python/README_import.md b/python/README_import.md
index ae9764e6..38c8406f 100644
--- a/python/README_import.md
+++ b/python/README_import.md
@@ -24,7 +24,7 @@ the others:
From CSV file:
export LC_ALL=C.UTF-8
- time ./client.py import-issn /srv/datasets/journal_extra_metadata.csv
+ time ./fatcat_import.py import-issn /srv/datasets/journal_extra_metadata.csv
real 2m42.148s
user 0m11.148s
@@ -36,38 +36,38 @@ Pretty quick, a few minutes.
Directly from compressed tarball; takes about 2 hours in production:
- tar xf /srv/datasets/public_profiles_API-2.0_2017_10_json.tar.gz -O | jq -c . | grep '"person":' | time parallel -j12 --pipe --round-robin ./client.py import-orcid -
+ tar xf /srv/datasets/public_profiles_API-2.0_2017_10_json.tar.gz -O | jq -c . | grep '"person":' | time parallel -j12 --pipe --round-robin ./fatcat_import.py import-orcid -
After tuning database, `jq` CPU seems to be bottleneck, so, from pre-extracted
tarball:
tar xf /srv/datasets/public_profiles_API-2.0_2017_10_json.tar.gz -O | jq -c . | rg '"person":' > /srv/datasets/public_profiles_1_2_json.all.json
- time parallel --bar --pipepart -j8 -a /srv/datasets/public_profiles_1_2_json.all.json ./client.py import-orcid -
+ time parallel --bar --pipepart -j8 -a /srv/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py import-orcid -
Does not work:
- ./client.py import-orcid /data/orcid/partial/public_profiles_API-2.0_2017_10_json/3/0000-0001-5115-8623.json
+ ./fatcat_import.py import-orcid /data/orcid/partial/public_profiles_API-2.0_2017_10_json/3/0000-0001-5115-8623.json
Instead:
- cat /data/orcid/partial/public_profiles_API-2.0_2017_10_json/3/0000-0001-5115-8623.json | jq -c . | ./client.py import-orcid -
+ cat /data/orcid/partial/public_profiles_API-2.0_2017_10_json/3/0000-0001-5115-8623.json | jq -c . | ./fatcat_import.py import-orcid -
Or for many files:
- find /data/orcid/partial/public_profiles_API-2.0_2017_10_json/3 -iname '*.json' | parallel --bar jq -c . {} | rg '"person":' | ./client.py import-orcid -
+ find /data/orcid/partial/public_profiles_API-2.0_2017_10_json/3 -iname '*.json' | parallel --bar jq -c . {} | rg '"person":' | ./fatcat_import.py import-orcid -
### ORCID Performance
for ~9k files:
- (python-B2RYrks8) bnewbold@orithena$ time parallel --pipepart -j4 -a /data/orcid/partial/public_profiles_API-2.0_2017_10_json/all.json ./client.py import-orcid -
+ (python-B2RYrks8) bnewbold@orithena$ time parallel --pipepart -j4 -a /data/orcid/partial/public_profiles_API-2.0_2017_10_json/all.json ./fatcat_import.py import-orcid -
real 0m15.294s
user 0m28.112s
sys 0m2.408s
=> 636/second
- (python-B2RYrks8) bnewbold@orithena$ time ./client.py import-orcid /data/orcid/partial/public_profiles_API-2.0_2017_10_json/all.json
+ (python-B2RYrks8) bnewbold@orithena$ time ./fatcat_import.py import-orcid /data/orcid/partial/public_profiles_API-2.0_2017_10_json/all.json
real 0m47.268s
user 0m2.616s
sys 0m0.104s
@@ -94,11 +94,11 @@ After some simple database tuning:
From compressed:
- xzcat /srv/datasets/crossref-works.2018-01-21.json.xz | time parallel -j20 --round-robin --pipe ./client.py import-crossref - /srv/datasets/20180216.ISSN-to-ISSN-L.txt
+ xzcat /srv/datasets/crossref-works.2018-01-21.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py import-crossref - /srv/datasets/20180216.ISSN-to-ISSN-L.txt
## Manifest
- time ./client.py import-manifest /srv/datasets/idents_files_urls.sqlite
+ time ./fatcat_import.py import-manifest /srv/datasets/idents_files_urls.sqlite
[...]
Finished a batch; row 284518671 of 9669646 (2942.39%). Total inserted: 6606900