aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2019-02-01 11:36:55 -0800
committerBryan Newbold <bnewbold@robocracy.org>2019-02-01 11:41:24 -0800
commitda504b5f393b7e97f59c458d74dce44ee7719557 (patch)
tree57d5907626f10dbbb35fa354d6359a3015890a57
parent7e651a8d10d8f71b463dd7439b04909a42f0cd4c (diff)
downloadfatcat-da504b5f393b7e97f59c458d74dce44ee7719557.tar.gz
fatcat-da504b5f393b7e97f59c458d74dce44ee7719557.zip
'final' (first?) bootstraps in progress
-rw-r--r--notes/bootstrap/import_timing_20190129.txt125
-rw-r--r--notes/bootstrap/import_timing_20190130.txt150
2 files changed, 275 insertions, 0 deletions
diff --git a/notes/bootstrap/import_timing_20190129.txt b/notes/bootstrap/import_timing_20190129.txt
new file mode 100644
index 00000000..6d635f92
--- /dev/null
+++ b/notes/bootstrap/import_timing_20190129.txt
@@ -0,0 +1,125 @@
+
+This is the first attempt at a clean final production import. Running in QA; if
+all goes well would dump and import in prod.
+
+Made a number of changes since yesterday's import, so won't be surprised if run
+in to problems. Plan is to make any fixes and push through to the end to turn
+up any additional issues/bugs, then iterate yet again if needed.
+
+## Service up/down
+
+ sudo service fatcat-web stop
+ sudo service fatcat-api stop
+
+ # shutdown all the import/export/etc
+ # delete any snapshots and /tmp/fatcat*
+ sudo rm /srv/fatcat/snapshots/*
+ sudo rm /tmp/fatcat_*
+
+ # git pull
+ # ansible playbook push
+ # re-build fatcat-api to ensure that worked
+
+ sudo service fatcat-web stop
+ sudo service fatcat-api stop
+
+ # as postgres user:
+ DATABASE_URL=postgres://postgres@/fatcat_prod /opt/cargo/bin/diesel database reset
+ sudo service postgresql restart
+
+ http delete :9200/fatcat_release
+ http delete :9200/fatcat_container
+ http delete :9200/fatcat_changelog
+ http put :9200/fatcat_release < release_schema.json
+ http put :9200/fatcat_container < container_schema.json
+ http put :9200/fatcat_changelog < changelog_schema.json
+ sudo service elasticsearch stop
+ sudo service kibana stop
+
+ sudo service fatcat-api start
+
+ # ensure rust/.env -> /srv/fatcat/config/fatcat_api.env
+ wget https://archive.org/download/ia_journal_metadata/journal_metadata.2019-01-25.json
+
+ # create new auth keys via bootstrap (edit debug -> release first)
+ # update config/env/ansible/etc with new tokens
+ # delete existing entities
+
+ # run the imports!
+
+ # after running below imports
+ sudo service fatcat-web start
+ sudo service elasticsearch start
+ sudo service kibana start
+
+## Import commands
+
+ rust version (as webcrawl): 1.32.0
+ git commit: 586458cacabd1d2f4feb0d0f1a9558f229f48f5e
+
+ export LC_ALL=C.UTF-8
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA="..."
+ time ./fatcat_import.py journal-metadata /srv/fatcat/datasets/journal_metadata.2019-01-25.json
+
+ hit a bug (see below) but safe to continue
+
+ Counter({'total': 107869, 'insert': 102623, 'exists': 5200, 'skip': 46, 'update': 0})
+ real 4m43.635s
+ user 1m55.904s
+ sys 0m5.376s
+
+ export FATCAT_AUTH_WORKER_ORCID="..."
+ time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -
+
+ hit another bug (see below), again safe to continue
+
+ Counter({'total': 48888, 'insert': 48727, 'skip': 161, 'exists': 0, 'update': 0}) (etc)
+ real 29m56.773s
+ user 89m2.532s
+ sys 5m11.104s
+
+ export FATCAT_AUTH_WORKER_CROSSREF="..."
+ time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 --bezerk-mode
+
+ running very slow; maybe batch size of 100 was too large? pushing 80+
+ MB/sec, but very little CPU utilization. some ISSN lookups taking up to
+ a second each (!). no vacuum in progress. at xzcat, only '2.1 MiB/s'
+
+ at current rate will take more than 48 hours. hrm.
+
+ after 3.5 hours or so, cancelled and restarted in non-bezerk mode, with batch size of 50.
+
+ xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py --batch-size 50 crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
+
+ at xzcat, about 5 Mib/s. just after citation efficiency, full import was around 20 hours.
+
+ if slows down again, may be due to some threads failing and not
+ dumping. if that's the case, should try with 'head -n200000' or so to
+ catch output errors.
+
+ ps aux | rg fatcat_import.py | rg -v perl | wc -l => 22
+
+ at 7.2% (beyond earlier progress), and now inserting (not just
+ lookups), pushing 5.6 MiB/sec, 17 hours (estimated) to go, seems to be
+ running fine.
+
+ at 12 hours in, at 20% and down to 1.9 MiB/sec again. Lots of disk I/O
+ (80 MB/sec write), seems to be bottleneck, not sure why.
+
+ would take... about an hour to restart, might save 20+ hours, might waste 14?
+
+ export FATCAT_AUTH_SANDCRAWLER="..."
+ export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_SANDCRAWLER
+ time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched --bezerk-mode -
+
+ time zcat /srv/fatcat/datasets/2018-12-18-2237.09-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched -
+
+ time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 grobid-metadata - --longtail-oa
+
+## Bugs encountered
+
+x broke a constraint or made an otherwise invalid request: name is required for all Container entities
+ => wasn't bezerk mode, so should be fine to continue
+x {"success":false,"error":"BadRequest","message":"broke a constraint or made an otherwise invalid request: display_name is required for all Creator entities"}
+ => wasn't bezerk mode, so should be fine to continue
+
diff --git a/notes/bootstrap/import_timing_20190130.txt b/notes/bootstrap/import_timing_20190130.txt
new file mode 100644
index 00000000..d102e39f
--- /dev/null
+++ b/notes/bootstrap/import_timing_20190130.txt
@@ -0,0 +1,150 @@
+
+The QA import is running really slow; this is a parallel attempt in case things
+are faster on the fatcat-prod2-vm machine, with 50 batch size and bezerk mode.
+
+## Service up/down
+
+ sudo service fatcat-web stop
+ sudo service fatcat-api stop
+
+ # shutdown all the import/export/etc
+ # delete any snapshots and /tmp/fatcat*
+ sudo rm /srv/fatcat/snapshots/*
+ sudo rm /tmp/fatcat_*
+
+ # git pull
+ # ansible playbook push
+ # re-build fatcat-api to ensure that worked
+
+ sudo service fatcat-web stop
+ sudo service fatcat-api stop
+
+ # as postgres user:
+ DATABASE_URL=postgres://postgres@/fatcat_prod /opt/cargo/bin/diesel database reset
+ sudo service postgresql restart
+
+ http delete :9200/fatcat_release
+ http delete :9200/fatcat_container
+ http delete :9200/fatcat_changelog
+ http put :9200/fatcat_release < release_schema.json
+ http put :9200/fatcat_container < container_schema.json
+ http put :9200/fatcat_changelog < changelog_schema.json
+ sudo service elasticsearch stop
+ sudo service kibana stop
+
+ sudo service fatcat-api start
+
+ # ensure rust/.env -> /srv/fatcat/config/fatcat_api.env
+ wget https://archive.org/download/ia_journal_metadata/journal_metadata.2019-01-25.json
+
+ # if necessary:
+ # ALTER USER fatcat WITH SUPERUSER;
+ # ALTER USER fatcat WITH PASSWORD '...';
+ # create new auth keys via bootstrap (edit debug -> release first)
+ # update config/env/ansible/etc with new tokens
+ # delete existing entities
+
+ # run the imports!
+
+ # after running below imports
+ sudo service fatcat-web start
+ sudo service elasticsearch start
+ sudo service kibana start
+
+## Import commands
+
+ rust version (as webcrawl): 1fe371288daf417cdf44b94e372b485426b47134
+ git commit: 1.32.0
+
+ export LC_ALL=C.UTF-8
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA="..."
+ time ./fatcat_import.py journal-metadata /srv/fatcat/datasets/journal_metadata.2019-01-25.json
+
+ Counter({'total': 107869, 'insert': 107823, 'skip': 46, 'update': 0, 'exists': 0})
+ real 6m2.287s
+ user 2m4.612s
+ sys 0m5.664s
+
+ export FATCAT_AUTH_WORKER_ORCID="..."
+ time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -
+
+ 98% 79:1=22s
+ Counter({'total': 48097, 'insert': 47908, 'skip': 189, 'exists': 0, 'update': 0})
+ 100% 80:0=0s
+
+ real 33m9.211s
+ user 93m33.040s
+ sys 5m32.176s
+
+ export FATCAT_AUTH_WORKER_CROSSREF="..."
+ time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 --bezerk-mode
+
+ seems to be maintaining 9.1 MiB/sec and estimates 15 hours. 200 M/sec disk write. we'll see!
+
+ 100 % 33.2 GiB / 331.9 GiB = 0.100 3.6 MiB/s 26:16:57
+
+ Counter({'total': 5001477, 'insert': 4784708, 'skip': 216769, 'update': 0, 'exists': 0})
+ 395971.48user 8101.15system 26:17:07elapsed 427%CPU (0avgtext+0avgdata 431560maxresident)k
+ 232972688inputs+477055792outputs (334645major+39067735minor)pagefaults 0swaps
+
+ real 1577m7.908s
+ user 6681m58.948s
+ sys 141m25.560s
+
+ export FATCAT_AUTH_SANDCRAWLER="..."
+ export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_SANDCRAWLER
+ time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched - --bezerk-mode
+
+ (accidentally lost, but took about 3 hours)
+
+ time zcat /srv/fatcat/datasets/2018-12-18-2237.09-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched -
+
+ Counter({'total': 827944, 'insert': 555359, 'exists': 261441, 'update': 11129, 'skip': 15})
+ 32115.82user 1370.12system 4:30:25elapsed 206%CPU (0avgtext+0avgdata 37312maxresident)k
+ 28200inputs+3767112outputs (108major+471069minor)pagefaults 0swaps
+
+ real 270m25.288s
+ user 535m52.908s
+ sys 22m56.328s
+
+ time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 grobid-metadata - --longtail-oa
+
+## Bugs encountered
+
+- empty AUTH_ALT_KEYS should just be ignored (not try to parse)
+
+Missing DOIs (out of scope?):
+
+ DOI not found: 10.1023/a:1009888907797
+ DOI not found: 10.1186/1471-2148-4-49
+ DOI not found: 10.1023/a:1026471016927
+ DOI not found: 10.1090/s0002-9939-04-07569-0
+ DOI not found: 10.1186/1742-4682-1-11
+ DOI not found: 10.1186/1477-3163-2-5
+ DOI not found: 10.1186/gb-2003-4-4-210
+ DOI not found: 10.1186/gb-2004-5-9-r63
+ DOI not found: 10.13188/2330-2178.1000008
+ DOI not found: 10.4135/9781473960749
+ DOI not found: 10.1252/kakoronbunshu1953.36.479
+ DOI not found: 10.2320/materia.42.461
+ DOI not found: 10.1186/1742-4933-3-3
+ DOI not found: 10.14257/ijsh
+ DOI not found: 10.1023/a:1016008714781
+ DOI not found: 10.1023/a:1016648722322
+ DOI not found: 10.1787/5k990rjhvtlv-en
+ DOI not found: 10.4064/fm
+ DOI not found: 10.1090/s0002-9947-98-01992-8
+ DOI not found: 10.1186/1475-925x-2-16
+ DOI not found: 10.1186/1479-5868-3-9
+ DOI not found: 10.1090/s0002-9939-03-07205-8
+ DOI not found: 10.1023/a:1008111923880
+ DOI not found: 10.1090/s0002-9939-98-04322-6
+ DOI not found: 10.1186/gb-2005-6-11-r93
+ DOI not found: 10.5632/jila1925.2.236
+ DOI not found: 10.1023/a:1011359428672
+ DOI not found: 10.1090/s0002-9947-97-01844-8
+ DOI not found: 10.1155/4817
+ DOI not found: 10.1186/1472-6807-1-5
+ DOI not found: 10.1002/(issn)1542-0981
+ DOI not found: 10.1186/rr115
+