diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 18:10:35 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 18:49:46 -0800 |
commit | ba7f9214d2038882952eb50cd4dc5eff4eb0e6ff (patch) | |
tree | 2f3ff3ba4b70f0f7d4603a224bf68cbe3892376b /python/README_import.md | |
parent | a6d994fbc18debcf3860e6deb12eb54234a42839 (diff) | |
download | fatcat-ba7f9214d2038882952eb50cd4dc5eff4eb0e6ff.tar.gz fatcat-ba7f9214d2038882952eb50cd4dc5eff4eb0e6ff.zip |
remove deprecated extid sqlite3 lookup table feature from importers
This was used during initial bulk imports, but is no longer used and
could create serious metadata problems if used accidentially.
In retrospect, it also made metadata provenance less transparent, and
may have done more harm than good overall.
Diffstat (limited to 'python/README_import.md')
-rw-r--r-- | python/README_import.md | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/python/README_import.md b/python/README_import.md index 6853a4d7..74e75e14 100644 --- a/python/README_import.md +++ b/python/README_import.md @@ -52,6 +52,7 @@ Usually tens of minutes on fast production machine. Usually 24 hours or so on fast production machine. + # NOTE: `--extid-map-file` was used during initial import, but is now deprecated time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 ## JALC @@ -59,6 +60,7 @@ Usually 24 hours or so on fast production machine. First import a random subset single threaded to create (most) containers. On a fast machine, this takes a couple minutes. + # NOTE: `--extid-map-file` was used during initial import, but is now deprecated time ./fatcat_import.py jalc /srv/fatcat/datasets/JALC-LOD-20180907.sample10k.rdf /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 Then, in parallel: @@ -116,6 +118,7 @@ Prep JSON files from sqlite (for parallel import): Run import in parallel: + # NOTE: `--extid-map-file` was used during initial import, but is now deprecated export FATCAT_AUTH_WORKER_CRAWL=... zcat /srv/fatcat/datasets/s2_doi.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py arabesque --json-file - --extid-type doi --crawl-id DIRECT-OA-CRAWL-2019 --no-require-grobid |