summaryrefslogtreecommitdiffstats
path: root/notes/import_timing_20180920.txt
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2018-09-25 15:44:39 -0700
committerBryan Newbold <bnewbold@robocracy.org>2018-09-25 15:44:39 -0700
commit186c03c2b9e1b08e4298383fc6593d1b2c3a5dd8 (patch)
treeb116678710d9a69df023f7dc90525d7563506891 /notes/import_timing_20180920.txt
parentc1391060cc4575365f81fc674e35d8c182ccde83 (diff)
downloadfatcat-186c03c2b9e1b08e4298383fc6593d1b2c3a5dd8.tar.gz
fatcat-186c03c2b9e1b08e4298383fc6593d1b2c3a5dd8.zip
start organizing notes into subdirectories
Diffstat (limited to 'notes/import_timing_20180920.txt')
-rw-r--r--notes/import_timing_20180920.txt96
1 files changed, 0 insertions, 96 deletions
diff --git a/notes/import_timing_20180920.txt b/notes/import_timing_20180920.txt
deleted file mode 100644
index a57ffd77..00000000
--- a/notes/import_timing_20180920.txt
+++ /dev/null
@@ -1,96 +0,0 @@
-
-This was on fatcat-prod-vm (2TB disk).
-
- time ./fatcat_import.py import-issn /srv/fatcat/datasets/journal_extra_metadata.csv
-
- Processed 53300 lines, inserted 53283, updated 0.
- real 0m32.463s
- user 0m8.716s
- sys 0m0.284s
-
- time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py import-orcid -
-
- Processed 48900 lines, inserted 48731, updated 0. <= these numbers times 80x
- 100% 80:0=0s
-
- real 10m20.598s
- user 26m16.544s
- sys 1m40.284s
-
- time xzcat /srv/fatcat/datasets/crossref-works.2018-01-21.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py import-crossref - /srv/fatcat/datasets/20180216.ISSN-to-ISSN-L.txt /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
-
- Processed 4679900 lines, inserted 3755867, updated 0.
- 107730.08user 4110.22system 16:31:25elapsed 188%CPU (0avgtext+0avgdata 447496maxresident)k
- 77644160inputs+361948352outputs (105major+49094767minor)pagefaults 0swaps
-
- => 16.5 hours, faster!
-
- select count(id) from release_ident; => 75106713
-
- kernel/system crashed after first file import (!), so don't have numbers from that.
-
-Table sizes at this point:
-
- select count(id) from file_ident; => 6334606
-
- Size: 389.25G
-
- table_name | table_size | indexes_size | total_size
---------------------------------------------------------------+------------+--------------+------------
- "public"."release_ref" | 170 GB | 47 GB | 217 GB
- "public"."release_rev" | 44 GB | 21 GB | 65 GB
- "public"."release_contrib" | 19 GB | 20 GB | 39 GB
- "public"."release_edit" | 6671 MB | 6505 MB | 13 GB
- "public"."work_edit" | 6671 MB | 6505 MB | 13 GB
- "public"."release_ident" | 4892 MB | 5875 MB | 11 GB
- "public"."work_ident" | 4892 MB | 5874 MB | 11 GB
- "public"."work_rev" | 3174 MB | 2936 MB | 6109 MB
- "public"."file_rev_url" | 3634 MB | 1456 MB | 5090 MB
- "public"."file_rev" | 792 MB | 1281 MB | 2073 MB
- "public"."abstracts" | 1665 MB | 135 MB | 1800 MB
- "public"."file_edit" | 565 MB | 561 MB | 1126 MB
- "public"."file_release" | 380 MB | 666 MB | 1045 MB
- "public"."file_ident" | 415 MB | 496 MB | 911 MB
- "public"."creator_rev" | 371 MB | 457 MB | 828 MB
- "public"."creator_edit" | 347 MB | 353 MB | 700 MB
- "public"."creator_ident" | 255 MB | 305 MB | 559 MB
- "public"."release_rev_abstract" | 183 MB | 237 MB | 421 MB
- "public"."changelog" | 122 MB | 126 MB | 247 MB
- "public"."editgroup" | 138 MB | 81 MB | 219 MB
- "public"."container_rev" | 52 MB | 38 MB | 89 MB
- "public"."container_edit" | 32 MB | 30 MB | 62 MB
- "public"."container_ident" | 24 MB | 28 MB | 52 MB
-
-
- relname | too_much_seq | case | rel_size | seq_scan | idx_scan
-----------------------+--------------+------+--------------+----------+----------
- release_edit | 0 | OK | 6993084416 | 0 | 0
- container_rev | 0 | OK | 54124544 | 0 | 0
- creator_ident | 0 | OK | 266919936 | 0 | 0
- creator_edit | 0 | OK | 363921408 | 0 | 0
- work_rev | 0 | OK | 3327262720 | 0 | 0
- creator_rev | 0 | OK | 388726784 | 0 | 0
- work_ident | 0 | OK | 5128560640 | 0 | 0
- work_edit | 0 | OK | 6993108992 | 0 | 0
- container_ident | 0 | OK | 25092096 | 0 | 0
- file_edit | 0 | OK | 598278144 | 0 | 0
- container_edit | 0 | OK | 33857536 | 0 | 0
- changelog | 0 | OK | 127549440 | 0 | 0
- abstracts | -4714 | OK | 1706713088 | 0 | 4714
- file_release | -13583 | OK | 401752064 | 0 | 13583
- file_rev_url | -13583 | OK | 3832389632 | 0 | 13583
- editgroup | -74109 | OK | 144277504 | 0 | 74109
- release_contrib | -76699 | OK | 20357849088 | 0 | 76699
- release_ref | -76700 | OK | 183009157120 | 3 | 76703
- release_rev_abstract | -76939 | OK | 192102400 | 0 | 76939
- release_rev | -77965 | OK | 47602647040 | 0 | 77965
- file_ident | -100089 | OK | 438255616 | 3 | 100092
- release_ident | -152809 | OK | 5128617984 | 0 | 152809
- file_rev | -440780 | OK | 837705728 | 0 | 440780
-(23 rows)
-
-Continuing imports:
-
- zcat /srv/fatcat/datasets/2018-08-27-2352.17-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py import-matched -
- => HTTP response body: {"message":"duplicate key value violates unique constraint \"file_edit_editgroup_id_ident_id_key\""}
-