Changes since last time: - auto-accept flag (potentially no need to UPDATE/vacuum database) - large CRUD code refactor (but not complete removal of api_server CRUD yet) - removed "single query insert per entity", but added row-based inserts ## Laptop Rough Timing HTTP-VERB branch 7354899493f6448bed5698ad6ade1dbebcf39379: => note: in this branch, importer is sitll creating editgroups in separate request time ./fatcat_import.py import-issn /home/bnewbold/code/oa-journals-analysis/upload-2018-04-05/journal_extra_metadata.csv real 0m28.779s user 0m5.716s sys 0m0.208s cat /data/crossref/crossref-works.2018-01-21.badsample_100k.json | time parallel -j4 --round-robin --pipe ./fatcat_import.py import-crossref - /data/issn/20180216.ISSN-to-ISSN-L.txt 89.83user 4.58system 1:46.67elapsed 88%CPU (0avgtext+0avgdata 386184maxresident)k 78632inputs+385408outputs (1major+579693minor)pagefaults 0swaps # 1:46.67 elapsed # running again time ./fatcat_import.py import-issn /home/bnewbold/code/oa-journals-analysis/upload-2018-04-05/journal_extra_metadata.csv real 0m29.714s user 0m6.048s sys 0m0.212s cat /data/crossref/crossref-works.2018-01-21.badsample_100k.json | time parallel -j4 --round-robin --pipe ./fatcat_import.py import-crossref - /data/issn/20180216.ISSN-to-ISSN-L.txt 89.07user 4.52system 1:48.11elapsed 86%CPU (0avgtext+0avgdata 385940maxresident)k 0inputs+377456outputs (0major+583532minor)pagefaults 0swaps # 1:48.11 elapsed MASTER branch 0053d133f8ff96aa4dedc1ff7e2754812ddfc79a time ./fatcat_import.py import-issn /home/bnewbold/code/oa-journals-analysis/upload-2018-04-05/journal_extra_metadata.csv real 1m8.061s user 0m7.384s sys 0m0.280s cat /data/crossref/crossref-works.2018-01-21.badsample_100k.json | time parallel -j4 --round-robin --pipe ./fatcat_import.py import-crossref - /data/issn/20180216.ISSN-to-ISSN-L.txt 96.68user 5.25system 3:23.05elapsed 50%CPU (0avgtext+0avgdata 385516maxresident)k 0inputs+389168outputs (0major+586810minor)pagefaults 0swaps # 3:23.05 elapsed AUTO-ACCEPT branch 8cccbcdef11e7ddc761ec494cb894a8d49a0d510 time ./fatcat_import.py import-issn /home/bnewbold/code/oa-journals-analysis/upload-2018-04-05/journal_extra_metadata.csv real 1m7.757s user 0m6.240s sys 0m0.228s cat /data/crossref/crossref-works.2018-01-21.badsample_100k.json | time parallel -j4 --round-robin --pipe ./fatcat_import.py import-crossref - /data/issn/20180216.ISSN-to-ISSN-L.txt 91.73user 4.65system 3:26.61elapsed 46%CPU (0avgtext+0avgdata 385512maxresident)k 0inputs+384976outputs (0major+580544minor)pagefaults 0swaps So, annecdotally, seems like autoaccept didn't do much here (not vacuum-limited?), but the row inserts were more than a factor of 2x performance improvement. Great! Could try experimenting with even larger batch sizes... ## Production Import http-verb branch 7354899493f6448bed5698ad6ade1dbebcf39379 time ./fatcat_import.py import-issn /srv/datasets/journal_extra_metadata.csv real 0m30.845s user 0m8.884s sys 0m0.312s time parallel --bar --pipepart -j8 -a /srv/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py import-orcid - # TPS: 1181 real 6m54.019s => down from 22m! user 25m33.980s sys 1m30.480s xzcat /srv/datasets/crossref-works.2018-01-21.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py import-crossref - /srv/datasets/20180216.ISSN-to-ISSN-L.txt # fatcatd at 200% CPU (2 full cores); many PIDs/workers, but only one so busy (must be diesel/db driver?) # parallel at ~80% # postgres pretty busy; looks like doing PARSE on ever request? some idle in transaction # postgres still does a vacuum to analyze; good! # ~600-1000 TPS near start (large variance) # left to run overnight... # slowed down to ~50-80 TPS about 10 hours later # lots of IOWAIT # only at 40309165 rows in the morning; this could take a long time # PostgreSQL 10.4 - wbgrp-svc500.us.archive.org - postgres@localhost:5432/postgres - Ref.: 2s # Size: 153.71G - 1.90M/s | TPS: 77 # Mem.: 68.70% - 22.60G/49.14G | IO Max: 20448/s # Swap: 3.80% - 1.88G/50.00G | Read : 10.17M/s - 2603/s # Load: 11.71 11.80 12.01 | Write: 4.00M/s - 1024/s 92530.17user 2891.76system 35:45:15elapsed 74%CPU (0avgtext+0avgdata 463520maxresident)k 1093736inputs+302588448outputs (52568major+36405225minor)pagefaults 0swaps # 35:45:15 elapsed time ./fatcat_import.py import-manifest /srv/datasets/idents_files_urls.sqlite ## Perf Tweaks SELECT relname, seq_scan - idx_scan AS too_much_seq, CASE WHEN seq_scan - coalesce(idx_scan, 0) > 0 THEN 'Missing Index?' ELSE 'OK' END, pg_relation_size(relname::regclass) AS rel_size, seq_scan, idx_scan FROM pg_stat_all_tables WHERE schemaname = 'public' AND pg_relation_size(relname::regclass) > 80000 ORDER BY too_much_seq DESC; relname | too_much_seq | case | rel_size | seq_scan | idx_scan ----------------------+--------------+------+--------------+----------+----------- file_rev_url | -1 | OK | 163323904 | 2 | 3 file_release | -3 | OK | 24461312 | 2 | 5 release_edit | -10265 | OK | 6080495616 | 2 | 10267 container_edit | -10265 | OK | 31170560 | 2 | 10267 work_edit | -10265 | OK | 6080364544 | 2 | 10267 file_edit | -10265 | OK | 49111040 | 2 | 10267 creator_edit | -10265 | OK | 330924032 | 2 | 10267 changelog | -11692 | OK | 106668032 | 2 | 11694 release_ref | -374197 | OK | 125437673472 | 3 | 374200 release_contrib | -374202 | OK | 16835354624 | 3 | 374205 release_rev_abstract | -374919 | OK | 162250752 | 3 | 374922 file_ident | -1047172 | OK | 41926656 | 2 | 1047174 container_rev | -1378943 | OK | 50356224 | 724612 | 2103555 <= file_rev | -2407647 | OK | 68493312 | 4 | 2407651 abstracts | -2765610 | OK | 1450901504 | 1 | 2765611 creator_ident | -7127467 | OK | 242688000 | 2 | 7127469 creator_rev | -7943007 | OK | 353370112 | 839155 | 8782162 <= container_ident | -57488245 | OK | 23060480 | 2 | 57488247 release_ident | -66085389 | OK | 4459159552 | 14 | 66085403 work_rev | -130613391 | OK | 2892775424 | 1 | 130613392 work_ident | -130633923 | OK | 4459192320 | 2 | 130633925 editgroup | -136775970 | OK | 120561664 | 1 | 136775971 release_rev | -718337215 | OK | 36850507776 | 8 | 718337223 Slowest queries: SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = $1 AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = $2 AND "creator_ident"."redirect_id" IS NULL LIMIT $3 SELECT "container_ident"."id", "container_ident"."is_live", "container_ident"."rev_id", "container_ident"."redirect_id", "container_rev"."id", "container_rev"."extra_json", "container_rev"."name", "container_rev"."publisher", "container_rev"."issnl", "container_rev"."wikidata_qid", "container_rev"."abbrev", "container_rev"."coden" FROM ("container_ident" INNER JOIN "container_rev" ON "container_ident"."rev_id" = "container_rev"."id") WHERE "container_rev"."issnl" = $1 AND "container_rev"."issnl" IS NOT NULL AND "container_ident"."is_live" = $2 AND "container_ident"."redirect_id" IS NULL LIMIT $3 SELECT $2 FROM ONLY "public"."release_rev" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x:...skipping... SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = $1 AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = $2 AND "creator_ident"."redirect_id" IS NULL LIMIT $3 DEBUG: (while a file import is running) Creator lookup where row exists: time curl localhost:9411/v0/creator/lookup?orcid=0000-0002-8867-1663 real 0m0.020s user 0m0.004s sys 0m0.004s SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = '0000-0002-8867-1663' AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = 't' AND "creator_ident"."redirect_id" IS NULL LIMIT 1; => (1 row) Time: 0.988 ms Creator lookup where row doesn't exist: bnewbold@wbgrp-svc500$ time curl localhost:9411/v0/creator/lookup?orcid=0000-0002-8867-166X {"message":"Not found: 0000-0002-8867-166X"} real 0m1.282s user 0m0.008s sys 0m0.000s Sep 10 21:50:49 wbgrp-svc500.us.archive.org fatcat-api[25327]: Sep 10 21:50:49.231 INFO GET http://localhost:9411/v0/creator/lookup?orcid=0000-0002-8867-166X 404 Not Found (1282 ms) SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = '0000-0002-8867-166X' AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = 't' AND "creator_ident"."redirect_id" IS NULL LIMIT 1; => (0 rows) Time: 0.810 ms ------- SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = '0000-0002-8867-1663' AND "creator_ident"."is_live" = 't' AND "creator_ident"."redirect_id" IS NULL LIMIT 1; SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = '0000-0002-8867-1663' AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = 't' AND "creator_ident"."redirect_id" IS NULL LIMIT 1; SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = '0000-0002-8867-166X' AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = 't' AND "creator_ident"."redirect_id" IS NULL LIMIT 1; from logs: execute __diesel_stmt_5: SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = $1 AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = $2 AND "creator_ident"."redirect_id" IS NULL LIMIT $3 NOTE: have been doing *full* postgres logs this whole time! probably a ton of disk churn. :( exact logs: SET TIME ZONE 'UTC' SET CLIENT_ENCODING TO 'UTF8' SELECT 1 SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = $1 AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = $2 AND "creator_ident"."redirect_id" IS NULL LIMIT $3 parameters: $1 = '0000-0002-8867-166X', $2 = 't', $3 = '1' SELECT "creator_ident"."id", "creator_ident"."is_live", "creator_ident"."rev_id", "creator_ident"."redirect_id", "creator_rev"."id", "creator_rev"."extra_json", "creator_rev"."display_name", "creator_rev"."given_name", "creator_rev"."surname", "creator_rev"."orcid", "creator_rev"."wikidata_qid" FROM ("creator_ident" INNER JOIN "creator_rev" ON "creator_ident"."rev_id" = "creator_rev"."id") WHERE "creator_rev"."orcid" = '0000-0002-8867-166X' AND "creator_rev"."orcid" IS NOT NULL AND "creator_ident"."is_live" = 't' AND "creator_ident"."redirect_id" IS NULL LIMIT '1'; CHANGES: - restart fatcat-api - restart postgresql - select pg_stat_reset(); seems like the fetch involves a index scans on creator_rev and creator_ident, *and* a scan on creator_rev. table_name | table_size | indexes_size | total_size --------------------------------------------------------------+------------+--------------+------------ "public"."release_ref" | 117 GB | 34 GB | 151 GB "public"."release_rev" | 34 GB | 9202 MB | 43 GB "public"."release_contrib" | 16 GB | 17 GB | 33 GB "public"."work_edit" | 5800 MB | 4033 MB | 9833 MB "public"."release_edit" | 5800 MB | 4032 MB | 9833 MB "public"."work_ident" | 4254 MB | 5102 MB | 9356 MB "public"."release_ident" | 4254 MB | 5095 MB | 9349 MB "public"."work_rev" | 2759 MB | 2545 MB | 5304 MB "public"."abstracts" | 1417 MB | 114 MB | 1530 MB "public"."creator_rev" | 337 MB | 277 MB | 614 MB "public"."creator_edit" | 316 MB | 221 MB | 536 MB "public"."creator_ident" | 232 MB | 279 MB | 511 MB "public"."file_rev_url" | 260 MB | 101 MB | 361 MB "public"."release_rev_abstract" | 155 MB | 200 MB | 355 MB "public"."file_rev" | 109 MB | 124 MB | 233 MB "public"."changelog" | 102 MB | 106 MB | 208 MB "public"."editgroup" | 116 MB | 69 MB | 184 MB "public"."file_ident" | 66 MB | 70 MB | 136 MB "public"."file_edit" | 78 MB | 53 MB | 131 MB "public"."file_release" | 38 MB | 64 MB | 103 MB "public"."container_rev" | 48 MB | 26 MB | 74 MB "public"."container_edit" | 30 MB | 20 MB | 50 MB "public"."container_ident" | 22 MB | 26 MB | 48 MB CHANGES: - remove "IS NOT NULL" from creator_rev, that seemed to be a significant speedup for the "row not found" case.