The QA import is running really slow; this is a parallel attempt in case things
are faster on the fatcat-prod2-vm machine, with 50 batch size and bezerk mode.

NOTE: this ended up being the successful/"final" bootstrap import.

## Service up/down

    sudo service fatcat-web stop
    sudo service fatcat-api stop

    # shutdown all the import/export/etc
    # delete any snapshots and /tmp/fatcat*
    sudo rm /srv/fatcat/snapshots/*
    sudo rm /tmp/fatcat_*

    # git pull
    # ansible playbook push
    # re-build fatcat-api to ensure that worked

    sudo service fatcat-web stop
    sudo service fatcat-api stop

    # as postgres user:
    DATABASE_URL=postgres://postgres@/fatcat_prod /opt/cargo/bin/diesel database reset
    sudo service postgresql restart

    http delete :9200/fatcat_release
    http delete :9200/fatcat_container
    http delete :9200/fatcat_changelog
    http put :9200/fatcat_release < release_schema.json
    http put :9200/fatcat_container < container_schema.json
    http put :9200/fatcat_changelog < changelog_schema.json
    sudo service elasticsearch stop
    sudo service kibana stop

    sudo service fatcat-api start

    # ensure rust/.env -> /srv/fatcat/config/fatcat_api.env
    wget https://archive.org/download/ia_journal_metadata/journal_metadata.2019-01-25.json

    # if necessary:
    #  ALTER USER fatcat WITH SUPERUSER;
    #  ALTER USER fatcat WITH PASSWORD '...';
    # create new auth keys via bootstrap (edit debug -> release first)
    # update config/env/ansible/etc with new tokens
    # delete existing entities

    # run the imports!

    # after running below imports
    sudo service fatcat-web start
    sudo service elasticsearch start
    sudo service kibana start

## Import commands

    rust version (as webcrawl): 1fe371288daf417cdf44b94e372b485426b47134
    git commit: 1.32.0

    export LC_ALL=C.UTF-8
    export FATCAT_AUTH_WORKER_JOURNAL_METADATA="..."
    time ./fatcat_import.py journal-metadata /srv/fatcat/datasets/journal_metadata.2019-01-25.json

        Counter({'total': 107869, 'insert': 107823, 'skip': 46, 'update': 0, 'exists': 0})
        real    6m2.287s
        user    2m4.612s
        sys     0m5.664s

    export FATCAT_AUTH_WORKER_ORCID="..."
    time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -

        98% 79:1=22s
        Counter({'total': 48097, 'insert': 47908, 'skip': 189, 'exists': 0, 'update': 0})
        100% 80:0=0s                                                                                                 

        real    33m9.211s
        user    93m33.040s
        sys     5m32.176s

    export FATCAT_AUTH_WORKER_CROSSREF="..."
    time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 --bezerk-mode

        seems to be maintaining 9.1 MiB/sec and estimates 15 hours. 200 M/sec disk write. we'll see!

        100 %        33.2 GiB / 331.9 GiB = 0.100   3.6 MiB/s   26:16:57

        Counter({'total': 5001477, 'insert': 4784708, 'skip': 216769, 'update': 0, 'exists': 0})
        395971.48user 8101.15system 26:17:07elapsed 427%CPU (0avgtext+0avgdata 431560maxresident)k
        232972688inputs+477055792outputs (334645major+39067735minor)pagefaults 0swaps

        real    1577m7.908s
        user    6681m58.948s
        sys     141m25.560s

    export FATCAT_AUTH_SANDCRAWLER="..."
    export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_SANDCRAWLER
    time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched - --bezerk-mode

        (accidentally lost, but took about 3 hours)

    time zcat /srv/fatcat/datasets/2018-12-18-2237.09-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched -

        Counter({'total': 827944, 'insert': 555359, 'exists': 261441, 'update': 11129, 'skip': 15})
        32115.82user 1370.12system 4:30:25elapsed 206%CPU (0avgtext+0avgdata 37312maxresident)k
        28200inputs+3767112outputs (108major+471069minor)pagefaults 0swaps

        real    270m25.288s
        user    535m52.908s
        sys     22m56.328s

    time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 grobid-metadata - --longtail-oa

    1.6M 2:02:05 [ 218 /s]
    Counter({'total': 133095, 'insert': 120176, 'inserted.release': 120176, 'exists': 12919, 'skip': 0, 'update': 0})
    20854.82user 422.09system 2:02:12elapsed 290%CPU (0avgtext+0avgdata 63816maxresident)k
    29688inputs+21057912outputs (118major+809972minor)pagefaults 0swaps

    real    122m12.533s
    user    350m14.824s
    sys     7m29.820s

## After Import Stats

    bnewbold@wbgrp-svc503$ df -h .
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/vda1       1.8T  591G  1.1T  36% /

    Size:  294.82G

    select count(*) from changelog => 2,306,900


                          table_name                          | table_size | indexes_size | total_size 
--------------------------------------------------------------+------------+--------------+------------
 "public"."refs_blob"                                         | 70 GB      | 1896 MB      | 72 GB
 "public"."release_rev"                                       | 36 GB      | 32 GB        | 68 GB
 "public"."release_contrib"                                   | 25 GB      | 23 GB        | 48 GB
 "public"."release_edit"                                      | 9342 MB    | 10 GB        | 19 GB
 "public"."work_edit"                                         | 9342 MB    | 10 GB        | 19 GB
 "public"."release_ident"                                     | 6334 MB    | 10235 MB     | 16 GB
 "public"."work_ident"                                        | 6333 MB    | 10235 MB     | 16 GB
 "public"."file_rev_url"                                      | 6085 MB    | 2251 MB      | 8337 MB
 "public"."work_rev"                                          | 4092 MB    | 3795 MB      | 7887 MB
 "public"."file_rev"                                          | 1706 MB    | 2883 MB      | 4589 MB
 "public"."abstracts"                                         | 4089 MB    | 300 MB       | 4390 MB
 "public"."file_edit"                                         | 1403 MB    | 1560 MB      | 2963 MB
 "public"."file_ident"                                        | 944 MB     | 1529 MB      | 2473 MB
 "public"."file_rev_release"                                  | 889 MB     | 1558 MB      | 2447 MB
 "public"."release_rev_abstract"                              | 404 MB     | 536 MB       | 941 MB
 "public"."creator_rev"                                       | 371 MB     | 457 MB       | 827 MB
 "public"."creator_edit"                                      | 377 MB     | 420 MB       | 797 MB
 "public"."editgroup"                                         | 480 MB     | 285 MB       | 766 MB
 "public"."creator_ident"                                     | 255 MB     | 412 MB       | 667 MB
 "public"."changelog"                                         | 135 MB     | 139 MB       | 274 MB
 "public"."container_rev"                                     | 31 MB      | 11 MB        | 42 MB
 "public"."container_edit"                                    | 10 MB      | 12 MB        | 22 MB
 "public"."container_ident"                                   | 7216 kB    | 12 MB        | 19 MB

       relname        | too_much_seq | case |  rel_size   | seq_scan | idx_scan  
----------------------+--------------+------+-------------+----------+-----------
 creator_edit         |       -94655 | OK   |   395558912 |        2 |     94657
 container_edit       |       -94655 | OK   |    10911744 |        2 |     94657
 file_edit            |       -94655 | OK   |  1470627840 |        2 |     94657
 work_edit            |       -94655 | OK   |  9793445888 |        2 |     94657
 release_edit         |       -94655 | OK   |  9793241088 |        2 |     94657
 container_rev        |     -1168077 | OK   |    32546816 |        3 |   1168080
 file_rev_release     |     -3405015 | OK   |   931627008 |        2 |   3405017
 file_rev_url         |     -3405015 | OK   |  6379298816 |        2 |   3405017
 changelog            |     -3883131 | OK   |   141934592 |      382 |   3883513
 abstracts            |     -8367919 | OK   |  4011868160 |        1 |   8367920
 creator_ident        |     -9066121 | OK   |   267124736 |        5 |   9066126
 creator_rev          |    -14129509 | OK   |   388431872 |        3 |  14129512
 release_contrib      |    -17121962 | OK   | 26559053824 |        3 |  17121965
 release_rev_abstract |    -17123930 | OK   |   423878656 |        3 |  17123933
 file_ident           |    -18428366 | OK   |   989888512 |        5 |  18428371
 refs_blob            |    -50251199 | OK   | 15969484800 |        1 |  50251200
 container_ident      |    -74332007 | OK   |     7364608 |        5 |  74332012
 file_rev             |    -99555196 | OK   |  1788166144 |        4 |  99555200
 release_ident        |   -132347345 | OK   |  6639624192 |        5 | 132347350
 work_rev             |   -193625747 | OK   |  4289314816 |        1 | 193625748
 work_ident           |   -196604815 | OK   |  6639476736 |        5 | 196604820
 editgroup            |   -214491911 | OK   |   503414784 |        3 | 214491914
 release_rev          |   -482813156 | OK   | 38609838080 |       11 | 482813167
(23 rows)

## Dump Stats / Process

    DATABASE_URL=fatcat_prod ./ident_table_snapshot.sh /tmp

        postgres@wbgrp-svc503:/srv/fatcat/src/extra/sql_dumps$ DATABASE_URL=fatcat_prod ./ident_table_snapshot.sh /tmp
        Will move output to '/tmp'
        Running SQL (from 'fatcat_prod')...
        BEGIN
        COPY 1
        COPY 3906704 -> creators
        COPY 107826 -> containers
        COPY 14378465 -> files
        COPY 3 -> filesets
        COPY 3 -> webcaptures
        COPY 96812903 -> releases
        COPY 96812903 -> works
        COPY 2306900 -> changelog
        ROLLBACK

        Done: /tmp/fatcat_idents.2019-02-01.214959.r2306900.tar.gz

    fatcat-export:
        x files
        x containers
        - releases_extended (TODO: estimate time to dump based on file timestamps)

        cat /tmp/fatcat_ident_releases.tsv | ./target/release/fatcat-export release --expand files,filesets,webcaptures,container -j8 | pv -l | gzip > /srv/fatcat/snapshots/release_export_expanded.json.gz

        96.8M 7:37:51 [3.52k/s]

        -rw-rw-r-- 1 webcrawl webcrawl  64G Feb  2 05:45 release_export_expanded.json.gz

    sql dumps:

        time sudo -u postgres pg_dump --verbose --format=tar fatcat_prod | pigz > /srv/fatcat/snapshots/fatcat_private_dbdump_${DATESLUG}.tar.gz

        real    112m34.310s
        user    296m46.112s
        sys     22m35.004s

        -rw-rw-r-- 1 bnewbold bnewbold  81G Feb  2 04:15 fatcat_private_dbdump_2019-02-02.022209.tar.gz

Looking for repeated SHA-1 and DOI:

    zcat file_hashes.tsv.gz | cut -f 3 | sort -S 8G | uniq -cd | sort -n > repeated_sha1.tsv
    => none

    zcat release_extid.tsv.gz | cut -f 3 | sort -S 8G | uniq -cd | sort -n > repeated_doi.tsv
    => a few million repeated *blank* lines... could filter out?

## Load Stats / Progress

    export LC_ALL=C.UTF-8
    time zcat /srv/fatcat/snapshots/release_export_expanded.json.gz | pv -l | ./fatcat_export.py transform-releases - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_release -type release
    time zcat /srv/fatcat/snapshots/container_export.json.gz | pv -l | ./fatcat_export.py transform-containers - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_container -type container


    time zcat /srv/fatcat/snapshots/2019-01-30/container_export.json.gz | pv -l | ./fatcat_export.py transform-containers - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_container -type container

        real    0m58.528s
        user    1m0.396s
        sys     0m2.412s

    # very python-CPU-limited, so crank that -j20
    # hadn't used '--linebuffer' with parallel before, but otherwise it holds
    # on to all the output lines before passing on to the next pipe program
    time zcat /srv/fatcat/snapshots/2019-01-30/release_export_expanded.json.gz | pv -l | parallel -j20 --linebuffer --round-robin --pipe ./fatcat_export.py transform-releases - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_release -type release

        165k 0:00:10 [18.4k/s]

        2019/02/02 09:30:49 96812900 docs in 2h27m32.835681602s at 10935.807 docs/s with 8 workers
        2019/02/02 09:30:49 applied setting: {"index": {"refresh_interval": "1s"}} with status 200 OK
        2019/02/02 09:30:49 applied setting: {"index": {"number_of_replicas": "1"}} with status 200 OK
        2019/02/02 09:31:03 index flushed: 200 OK

        real    147m46.387s
        user    2621m40.420s
        sys     56m11.456s

    sudo su postgres
    dropdb fatcat_prod
    #zcat fatcat_private_dbdump_2019-02-02.022209.tar.gz | pg_restore --clean --if-exists --create --exit-on-error -d fatcat_prod
    createdb fatcat_prod
    time zcat fatcat_private_dbdump_2019-02-02.022209.tar.gz  | pg_restore --exit-on-error --clean --if-exists --dbname fatcat_prod

        seems to go pretty fast, so multiple jobs probably not needed

        real    284m40.448s
        user    58m45.240s
        sys     7m33.600s

DONE: delete old elastic index

## Bugs/Issues encountered

x in_ia_sim is broken; not passing through
x elastic port (9200) was not open to cluster
    => but should close; should be over HTTP
x elasticsearch host wrong (should be search.fatcat.wiki)
    => search.fatcat.wiki
x postgres config wasn't actually getting installed in the right place by
  ansible (!!!), which probably had crazy effects on performance, etc
x postgres version confusion was because both versions (server and client) can
  be installed in parallel, and older version "wins". wiping VM would solve this.
x should try pigz for things like ident_table_snapshot and exports? these seem to be gzip-limited
- fatcat-export and pg_dump seem to mutually lock (transaction-wise), which is
  unexpected. fatcat-export should have very loose (low-priority) transaction
  scope, because it already has the full release_rev id, and pg_dump should
  also be in background/non-linear mode (except for "public" dumps?)
    => this was somewhat subtle; didn't completely lock
- this machine is postgres 10, not postgres 11. same with fatcat-prod1-vm.

Added to TODO:
- want a better "write lock" flag (on database) other than clearing auth key
- KBART CLOCKSS reports (and maybe LOCKSS?) have repeated lines, need to be merged
- empty AUTH_ALT_KEYS should just be ignored (not try to parse)

## Metadata Quality Notes

- crossref references look great!
- extra/crossref/alternative-id often includes exact full DOI
        10.1158/1538-7445.AM10-3529
        10.1158/1538-7445.am10-3529
    => but not always? publisher-specific
- contribs[]/extra/seq often has "first" from crossref
    => is this helpful?
- abstracts content is fine, but should probably check for "jats:" when setting
  mimetype
x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0
    => https://api.qa.fatcat.wiki/v0/release/55y37c3dtfcw3nw5owugwwhave
       10.26891/jik.v10i2.2016.92-97
- original title works, yay!
    https://api.qa.fatcat.wiki/v0/release/nlmnplhrgbdalcy472hfb2z3im
    10.2504/kds.26.358
- new license: https://www.karger.com/Services/SiteLicenses
- not copying ISBNs: 10.1016/b978-0-08-037302-7.50022-7
    "9780080373027"
    could at least put in alternative-id?
- BUG: subtitle coming through as an array, not string
- `license_slug` does get set
    eg for PLOS ONE http://creativecommons.org/licenses/by/4.0/
- page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude
- BUG (?): file missing size:
    https://fatcat.wiki/file/wpvkiqx2w5celc3ajyfsh3cfsa
- webface BUG: file-to-release links missing
- webface meh: still need to collapse links by domain better, and also vs. www.x/x

I think this is good (enough)!

Possible other KBART sources: Hathitrust, PKP preservation net (open, OJS), scholars portal (?), british library

Nature mag kbart clocks in empty (?)
    ISSN-L: 0028-0836
    https://fatcat.wiki/container/drfdii35rzaibj3aml5uhvr5xm

Missing DOIs (out of scope?):

    DOI not found: 10.1023/a:1009888907797
    DOI not found: 10.1186/1471-2148-4-49
    DOI not found: 10.1023/a:1026471016927
    DOI not found: 10.1090/s0002-9939-04-07569-0
    DOI not found: 10.1186/1742-4682-1-11
    DOI not found: 10.1186/1477-3163-2-5
    DOI not found: 10.1186/gb-2003-4-4-210
    DOI not found: 10.1186/gb-2004-5-9-r63
    DOI not found: 10.13188/2330-2178.1000008
    DOI not found: 10.4135/9781473960749
    DOI not found: 10.1252/kakoronbunshu1953.36.479
    DOI not found: 10.2320/materia.42.461
    DOI not found: 10.1186/1742-4933-3-3
    DOI not found: 10.14257/ijsh
    DOI not found: 10.1023/a:1016008714781
    DOI not found: 10.1023/a:1016648722322
    DOI not found: 10.1787/5k990rjhvtlv-en
    DOI not found: 10.4064/fm
    DOI not found: 10.1090/s0002-9947-98-01992-8
    DOI not found: 10.1186/1475-925x-2-16
    DOI not found: 10.1186/1479-5868-3-9
    DOI not found: 10.1090/s0002-9939-03-07205-8
    DOI not found: 10.1023/a:1008111923880
    DOI not found: 10.1090/s0002-9939-98-04322-6
    DOI not found: 10.1186/gb-2005-6-11-r93
    DOI not found: 10.5632/jila1925.2.236
    DOI not found: 10.1023/a:1011359428672
    DOI not found: 10.1090/s0002-9947-97-01844-8
    DOI not found: 10.1155/4817
    DOI not found: 10.1186/1472-6807-1-5
    DOI not found: 10.1002/(issn)1542-0981
    DOI not found: 10.1186/rr115