summaryrefslogtreecommitdiffstats
path: root/python/README_import.md
diff options
context:
space:
mode:
Diffstat (limited to 'python/README_import.md')
-rw-r--r--python/README_import.md9
1 files changed, 6 insertions, 3 deletions
diff --git a/python/README_import.md b/python/README_import.md
index cc9a94e1..2465940b 100644
--- a/python/README_import.md
+++ b/python/README_import.md
@@ -26,11 +26,13 @@ the others:
wget https://archive.org/download/ia_papers_manifest_2018-01-25/index/idents_files_urls.sqlite.gz
wget https://archive.org/download/ia_journal_metadata_explore_2018-04-05/journal_extra_metadata.csv
wget https://archive.org/download/issn_issnl_mappings/20180216.ISSN-to-ISSN-L.txt
- wget https://archive.org/download/orcid-dump-2017/public_profiles_API-2.0_2017_10_json.tar.gz
+ wget https://archive.org/download/orcid-dump-2017/public_profiles_1_2_json.all.json.gz
wget https://archive.org/download/ia_journal_pid_map_munge_20180908/release_ids.ia_munge_20180908.sqlite3.gz
wget https://archive.org/download/ia_test_paper_matches/2018-08-27-2352.17-matchcrossref.insertable.json.gz
wget https://archive.org/download/ia_papers_manifest_2018-01-25_matched/ia_papers_manifest_2018-01-25.matched.json.gz
+ gunzip public_profiles_1_2_json.all.json.gz
+
## ISSN
From CSV file:
@@ -54,13 +56,14 @@ Usually 24 hours or so on fast production machine.
## Matched
-Unknown speed!
+These each take 2-4 hours:
# No file update for the first import...
- zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py matched --no-file-updates -
+ time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py matched --no-file-updates -
# ... but do on the second
zcat /srv/fatcat/datasets/2018-08-27-2352.17-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py matched -
# GROBID extracted (release+file)
time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py grobid-metadata -
+