aboutsummaryrefslogtreecommitdiffstats
path: root/python/README_import.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-09 18:10:35 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-09 18:49:46 -0800
commitba7f9214d2038882952eb50cd4dc5eff4eb0e6ff (patch)
tree2f3ff3ba4b70f0f7d4603a224bf68cbe3892376b /python/README_import.md
parenta6d994fbc18debcf3860e6deb12eb54234a42839 (diff)
downloadfatcat-ba7f9214d2038882952eb50cd4dc5eff4eb0e6ff.tar.gz
fatcat-ba7f9214d2038882952eb50cd4dc5eff4eb0e6ff.zip
remove deprecated extid sqlite3 lookup table feature from importers
This was used during initial bulk imports, but is no longer used and could create serious metadata problems if used accidentially. In retrospect, it also made metadata provenance less transparent, and may have done more harm than good overall.
Diffstat (limited to 'python/README_import.md')
-rw-r--r--python/README_import.md3
1 files changed, 3 insertions, 0 deletions
diff --git a/python/README_import.md b/python/README_import.md
index 6853a4d7..74e75e14 100644
--- a/python/README_import.md
+++ b/python/README_import.md
@@ -52,6 +52,7 @@ Usually tens of minutes on fast production machine.
Usually 24 hours or so on fast production machine.
+ # NOTE: `--extid-map-file` was used during initial import, but is now deprecated
time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
## JALC
@@ -59,6 +60,7 @@ Usually 24 hours or so on fast production machine.
First import a random subset single threaded to create (most) containers. On a
fast machine, this takes a couple minutes.
+ # NOTE: `--extid-map-file` was used during initial import, but is now deprecated
time ./fatcat_import.py jalc /srv/fatcat/datasets/JALC-LOD-20180907.sample10k.rdf /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
Then, in parallel:
@@ -116,6 +118,7 @@ Prep JSON files from sqlite (for parallel import):
Run import in parallel:
+ # NOTE: `--extid-map-file` was used during initial import, but is now deprecated
export FATCAT_AUTH_WORKER_CRAWL=...
zcat /srv/fatcat/datasets/s2_doi.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py arabesque --json-file - --extid-type doi --crawl-id DIRECT-OA-CRAWL-2019 --no-require-grobid