## JALC Update to eee39965eee92b5005df0d967be779c2f2bb15f8 export FATCAT_AUTH_WORKER_JALC=blah Extracted file instead of piping it through zcat. Start small; do a random bunch (10k) single-threaded to pre-create containers: head -n100 /srv/fatcat/datasets/JALC-LOD-20180907.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 shuf -n100 /srv/fatcat/datasets/JALC-LOD-20180907.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 shuf -n10000 /srv/fatcat/datasets/JALC-LOD-20180907.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 Counter({'total': 9971, 'insert': 7138, 'exists': 2826, 'inserted.container': 144, 'skip': 7, 'update': 0}) Then the command: cat /srv/fatcat/datasets/JALC-LOD-20180907.rdf | pv -l | time parallel -j20 --round-robin --pipe ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 Bulk import: cat /srv/fatcat/datasets/JALC-LOD-20180907.rdf | pv -l | time parallel -j20 --round-robin --pipe ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 Hit an error: Traceback (most recent call last): File "./fatcat_import.py", line 365, in main() File "./fatcat_import.py", line 362, in main args.func(args) File "./fatcat_import.py", line 23, in run_jalc Bs4XmlLinesPusher(ji, args.xml_file, " main() File "./fatcat_import.py", line 362, in main args.func(args) File "./fatcat_import.py", line 43, in run_pubmed Bs4XmlLargeFilePusher(pi, args.xml_file, "PubmedArticle", record_list_tag="PubmedArticleSet").run() File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 666, in run self.importer.push_record(record) File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 302, in push_record entity = self.parse_record(raw_record) File "/srv/fatcat/src/python/fatcat_tools/importers/pubmed.py", line 494, in parse_record int(pub_date.Day.string)) ValueError: day is out of range for month Lesson here is to really get the whole thing to work end-to-end with no `parallel` error in QA before trying in prod. Was impatient! TODO: re-run these with a patch. going to do after dump/snapshot/etc though.