importer readmes

author: Vinay Goel <vinay@archive.org> 2018-06-21 21:22:14 +0000
committer: Vinay Goel <vinay@archive.org> 2018-06-21 21:22:14 +0000
commit: afc44f15c993cf64d7cd4554ab9410a172f2e1fc (patch)
tree: 550b40a906f474978093d69bbe60ae2a3637df00 /python/README_import.md
parent: 256b297886352fd0e732183e00d476bb32bc663e (diff)
download: fatcat-afc44f15c993cf64d7cd4554ab9410a172f2e1fc.tar.gz
fatcat-afc44f15c993cf64d7cd4554ab9410a172f2e1fc.zip
1 files changed, 46 insertions, 0 deletions
diff --git a/python/README_import.md b/python/README_import.md
index 11cb0fd8..60f91cf2 100644
--- a/python/README_import.md
+++ b/python/README_import.md
@@ -1,6 +1,41 @@
 
+Run in order:
+
+- ISSN
+- ORCID
+- Crossref
+- Manifest
+
+Lots of trouble with encoding; always `export LC_ALL=C.UTF-8`
+
+## Data Sources
+
+Download the following; uncompress the sqlite file, but **do not** uncompress
+the others:
+
+    https://archive.org/download/crossref_doi_dump_201801/crossref-works.2018-01-21.json.xz
+    https://archive.org/download/ia_papers_manifest_2018-01-25/index/idents_files_urls.sqlite.gz
+    https://archive.org/download/ia_journal_metadata_explore_2018-04-05/journal_extra_metadata.csv
+    https://archive.org/download/issn_issnl_mappings/20180216.ISSN-to-ISSN-L.txt
+    https://archive.org/download/orcid-dump-2017/public_profiles_API-2.0_2017_10_json.tar.gz
+
+## ISSN
+
+From CSV file:
+
+    time ./client.py import-issn /srv/datasets/journal_extra_metadata.csv
+
 ## ORCID
 
+Directly from compressed tarball:
+
+    tar xf /srv/datasets/public_profiles_API-2.0_2017_10_json.tar.gz -O | jq -c . | grep '"person":' | time parallel -j12 --pipe --round-robin ./client.py import-orcid -
+
+Or, from pre-uncompressed tarball:
+
+    tar xf /srv/datasets/public_profiles_API-2.0_2017_10_json.tar.gz -O | jq -c . | rg '"person":' > /srv/datasets/public_profiles_1_2_json.all.json
+    time parallel --bar --pipepart -j8 -a /srv/datasets/public_profiles_1_2_json.all.json ./client.py import-orcid -
+
 Does not work:
 
     ./client.py import-orcid /data/orcid/partial/public_profiles_API-2.0_2017_10_json/3/0000-0001-5115-8623.json
@@ -13,6 +48,7 @@ Or for many files:
 
     find /data/orcid/partial/public_profiles_API-2.0_2017_10_json/3 -iname '*.json' | parallel --bar jq -c . {} | rg '"person":' | ./client.py import-orcid -
 
+### ORCID Performance
 
 for ~9k files:
 
@@ -29,3 +65,13 @@ for ~9k files:
     sys     0m0.104s
 
     => 203/second
+
+## Crossref
+
+From compressed:
+
+    xzcat /srv/datasets/crossref-works.2018-01-21.json.xz | time parallel -j12 --round-robin --pipe ./client.py import-crossref - /srv/datasets/20180216.ISSN-to-ISSN-L.txt
+
+## Manifest 
+
+    time ./client.py import-manifest /srv/datasets/idents_files_urls.sqlite
author	Vinay Goel <vinay@archive.org>	2018-06-21 21:22:14 +0000
committer	Vinay Goel <vinay@archive.org>	2018-06-21 21:22:14 +0000
commit	afc44f15c993cf64d7cd4554ab9410a172f2e1fc (patch)
tree	550b40a906f474978093d69bbe60ae2a3637df00 /python/README_import.md
parent	256b297886352fd0e732183e00d476bb32bc663e (diff)
download	fatcat-afc44f15c993cf64d7cd4554ab9410a172f2e1fc.tar.gz fatcat-afc44f15c993cf64d7cd4554ab9410a172f2e1fc.zip