Run a partial ~5 million paper batch through:

    zcat /srv/fatcat_scholar/release_export.2019-07-07.5mil_fulltext.json.gz \
        | parallel -j8 --line-buffer --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
        | pv -l \
        | gzip > data/work_intermediate.5mil.json.gz
    => 5M 21:36:14 [64.3 /s]

    # runs about 70 works/sec with this parallelism => 1mil in 4hr, 5mil in 20hr
    # looks like seaweedfs is bottleneck?
    # tried stopping persist workers on seaweedfs and basically no change

    indexing to ES seems to take... an hour per million? or so. can check index
    monitoring to get better number

## 2020-07-23 First Full Release Batch

Patched to skip fetching `pdftext`

Run full batch through (on aitio), expecting this to take on the order of a
week:

    zcat /fast/download/release_export_expanded.json.gz \
        | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
        | pv -l \
        | gzip > /grande/snapshots/fatcat_scholar_work_fulltext.20200723.json.gz

Ah, this was running really slow because `MINIO_SECRET_KEY` was not set. Really
should replace `minio` python client library as we are now using seaweedfs!

Got an error:

    36.1M 15:29:38 [ 664 /s]
    parallel: Error: Output is incomplete. Cannot append to buffer file in /fast/tmp. Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
    Warning: unable to close filehandle properly: No space left on device during global destruction.

Might have been due to `/` filling up (not `/fast/tmp`)? Had gotten pretty far
in to processing. Restarted, will keep an eye on it.

To index, run from ES machine, as bnewbold:

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.partial.20200723.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_v01 -type _doc

Hrm, again:

    99.9M 56:04:41 [ 308 /s]
    parallel: Error: Output is incomplete. Cannot append to buffer file in /fast/tmp. Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.

Confirmed that disk was full in that moment; frustrating as had checked in and
disk usage was low enough before, and data was flowing to /grande (large
spinning disk). Should be sufficient to move release dump to `/bigger` and
clear more space on `/fast` to do the full indexing.

    /dev/sdg1       917G  871G     0 100% /fast

    vs.

    /dev/sdg1       917G  442G  430G  51% /fast

    -rw-rw-r-- 1 bnewbold bnewbold  418G Jul 27 05:55 fatcat_scholar_work_fulltext.20200723.json.gz

Got to about 2/3 of full release dump. Current rough estimates for total
processing times:

    enrich 150 million releases: 80hr (3-4 days), 650 GByte on disk (gzip)
    transform and index 150 million releases: 55hr (2-3 days), 1.5 TByte on disk (?)

Failed again, due to null `release.extra` field.

    # 14919639 + 83111800 = 98031439
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.20200723.json.gz     | gunzip   | tail -n +98031439  | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform     | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_v01 -type _doc

SIM remote indexing command:

    # size before (approx): 743.4 GByte, 98031407 docs; 546G disk free
    ssh aitio.us.archive.org cat /bigger/scholar_old/sim_intermediate.2020-07-23.json.gz     | gunzip   | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform     | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_v01 -type _doc
    => 1967593 docs in 2h8m32.549646403s at 255.116 docs/s with 4 workers
    # size after: 753.8gb 99926090 docs, 533G disk free

Trying dump again on AITIO, with alternative tmpdir:

    git log | head -n1
    commit 2f0874c84e71a02a10e21b03688593a4aa5ef426

    df -h /sandcrawler-db/
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sdf1       1.8T  684G  1.1T  40% /sandcrawler-db

    export TMPDIR=/sandcrawler-db/tmp
    zcat /fast/download/release_export_expanded.json.gz \
        | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
        | pv -l \
        | gzip > /grande/snapshots/fatcat_scholar_work_fulltext.20200723_two.json.gz

## ES Performance Iteration (2020-07-27)

- schema: switch abstracts from nested to simple array
- query: include fewer fields: just biblio (with boost; and maybe title) and "everything"
- query: use date-level granularity for time queries (may already do this?)
- set replica=0 (for now)
- set shards=12, to optimize *individual query* performance
    => if estimating 800 GByte index size, this is 60-70 GByte per shard
- set `index.codec=best_compression` to leverage CPU vs. disk I/O
- ensure transform output is sorted by key
    => <https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents>
- ensure number of cores is large
- return fewer results (15 vs. 25)
    => less highlighting
    => fewer thumbnails to catch

## Work Grouping

Plan for work-grouped expanded release dumps:

Have release identifier dump script include, and sort by, `work_id`. This will
definitely slow down that stage, unclear if too much. `work_id` is indexed.

Bulk dump script iterates and makes work batches of releases to dump, passes
Vec to worker threads. Worker threads pass back Vec of entities, then print all
of them (same work) sequentially.

## ES Permformance Profiling (2020-08-05)

Index size:

    green open scholar_fulltext_v01            uthJZJvSS-mlLIhZxrlVnA 12 0 102039078 578722 748.9gb 748.9gb

Unless otherwise mentioned, these are with default filters in place.

Baseline:

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"boosting": {"positive": {"bool": {"must": [{"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["title^5", "biblio_all^3", "abstracts.body^2", "fulltext.body", "everything"]}}], "should": [{"terms": {"access_type": ["ia_sim", "ia_file", "wayback"]}}]}}, "negative": {"bool": {"should": [{"bool": {"must_not": [{"exists": {"field": "title"}}]}}, {"bool": {"must_not": [{"exists": {"field": "year"}}]}}, {"bool": {"must_not": [{"exists": {"field": "type"}}]}}, {"bool": {"must_not": [{"exists": {"field": "stage"}}]}}, {"bool": {"must_not": [{"exists": {"field": "biblio.container_ident"}}]}}]}}, "negative_boost": 0.5}}]}}, "collapse": {"field": "collapse_key", "inner_hits": {"name": "more_pages", "size": 0}}, "from": 0, "size": 15, "highlight": {"fields": {"abstracts.body": {"number_of_fragments": 2, "fragment_size": 300}, "fulltext.body": {"number_of_fragments": 2, "fragment_size": 300}, "fulltext.acknowledgment": {"number_of_fragments": 2, "fragment_size": 300}, "fulltext.annex": {"number_of_fragments": 2, "fragment_size": 300}}}}


    jenny durkin
    => 60 Hits in 1.3sec

    "looking at you kid"
    => 83 Hits in 6.6sec

    LIGO black hole
    => 2,440 Hits in 1.6sec

    "configuration that formed when the core of a rapidly rotating massive star collapsed"
    => 1 Hits in 8.0sec
    => requery: in 0.3sec

Disable everything, query only `biblio_all`:

    {"query": {"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["biblio_all^3"]}}, "from": 0, "size": 15}

    newbold
    => 2,930 Hits in 0.12sec

    guardian galaxy
    => 15 Hits in 0.19sec

    *
    => 102,039,078 Hits in 0.86sec (same on repeat)

Query only `everything`:

    guardian galaxy
    => 1,456 Hits in 0.26sec

    avocado mexico
    => 3,407 Hits in 0.3sec, repeat in 0.017sec

    *
    => 102,039,078 Hits in 0.9sec (same on repeat)


Query all the fields with boosting:

    {"query": {"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["title^5", "biblio_all^3", "abstracts.body^2", "fulltext.body", "everything"]}}, "from": 0, "size": 15}

    berlin population
    => 168,690 Hits in 0.93sec repeat in in 0.11sec

    internet archive
    => 115,139 Hits in 1.1sec

    *
    => 102,039,078 Hits in 4.1sec (same on repeat)

Query only "everything", add highlighting (default config):

    indiana human
    => 86,556 Hits in 0.34sec repeat in 0.04sec
    => scholar-qa: 86,358 Hits in 2.4sec, repeat in 0.47sec

    wikipedia
    => 73,806 Hits in 0.13sec

Query only "everything", no highlighting, basic filters:

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"query_string": {"query": "reddit", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["everything"]}}]}}, "from": 0, "size": 15}

    reddit
    => 5,608 Hits in 0.12sec

    "participate in this collaborative editorial process"
    => 1 Hits in 7.9sec, repeat in in 0.4sec
    scholar-qa: timeout (>10sec)

    "one if by land, two if by sea"
    => 20 Hits in 4.5sec

Query only "title", no highlighting, basic filters:

    "discontinuities and noise due to crosstalk"
    => 0 Hits in 0.24sec
    scholar-qa: 1 Hits in 4.7sec

Query only "everything", no highlighting, collapse key:

    greed
    => 35,941 Hits in 0.47sec

    bjog child
    => 6,616 Hits in 0.4sec 

Query only "everything", no highlighting, collapse key, boosting:

    blue
    => 2,407,966 Hits in 3.1sec
    scholar-qa: 2,407,967 Hits in 1.6sec

    distal fin tuna
    => 390 Hits in 0.61sec

    "greater speed made possible by the warm muscle"
    => 1 Hits in 1.2sec

Query "everything", highlight "everything", collapse key, boosting (default but
only "everything" match):

    NOTE: highlighting didn't work

    green
    => 2,742,004 Hits in 3.1sec, repeat in in 2.8sec

    "comprehensive framework for the influences"
    => 1 Hits in 3.1sec

    bivalve extinct
    => 6,631 Hits in 0.47sec

    redwood "big basin"
    => 69 Hits in 0.5sec

Default, except only search+highlight "fulltext.body":

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"boosting": {"positive": {"bool": {"must": [{"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["fulltext.body"]}}], "should": [{"terms": {"access_type": ["ia_sim", "ia_file", "wayback"]}}]}}, "negative": {"bool": {"should": [{"bool": {"must_not": [{"exists": {"field": "title"}}]}}, {"bool": {"must_not": [{"exists": {"field": "year"}}]}}, {"bool": {"must_not": [{"exists": {"field": "type"}}]}}, {"bool": {"must_not": [{"exists": {"field": "stage"}}]}}, {"bool": {"must_not": [{"exists": {"field": "biblio.container_ident"}}]}}]}}, "negative_boost": 0.5}}]}}, "collapse": {"field": "collapse_key", "inner_hits": {"name": "more_pages", "size": 0}}, "from": 0, "size": 15, "highlight": {"fields": {"fulltext.body": {"number_of_fragments": 2, "fragment_size": 300}}}}

    radioactive fish eye yellow
    => 1,401 Hits in 0.84sec

    "Ground color yellowish pale, snout and mouth pale gray"
    => 1 Hits in 1.1sec

Back to baseline:

    "palace of the fine arts"
    => 26 Hits in 7.4sec

    john
    => 1,812,894 Hits in 3.1sec

Everything disabled, but fulltext query all the default fields:

    {"query": {"query_string": {"query": "john", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["title^5", "biblio_all^3", "abstracts.body^2", "fulltext.body", "everything"]}}, "from": 0, "size": 15}

    jane
    => 318,757 Hits in 0.29sec

    distress dolphin plant echo
    => 355 Hits in 1.5sec

    "Michael Longley's most recent collection of poems"
    => 1 Hits in 1.2sec

    aqua
    => 95,628 Hits in 0.27sec

Defaults, but query only "biblio_all":

    "global warming"
    => 2,712 Hits in 0.29sec

    pink
    => 1,805 Hits in 0.24sec

    *
    => 20,426,310 Hits in 7.5sec

    review
    => 795,060 Hits in 1.5sec

    "to be or not"
    => 319 Hits in 0.81sec

Simple filters, `biblio_all`, boosting disabled:

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["biblio_all^3"]}}]}}, "collapse": {"field": "collapse_key", "inner_hits": {"name": "more_pages", "size": 0}}, "from": 0, "size": 15}

    open
    => 155,337 Hits in 0.31sec

    all
    => 40,880 Hits in 0.24sec

    the
    => 7,369,084 Hits in 0.75sec

Boosting disabled, query only `biblio_all`:

    "triangulations among all simple spherical ones can be seen to be"
    => 0 Hits in 0.6sec, again in 0.028sec

    "di Terminal Agribisnis (Holding Ground) Rancamaya Bogor"
    => 1 Hits in 0.21sec

    "to be or not"
    => 319 Hits in 0.042sec

Same as above, add boosting back in:

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"query_string": {"query": "the", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["biblio_all^3"]}}, {"boosting": {"positive": {"bool": {"must": [{"query_string": {"query": "the", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["biblio_all^3"]}}], "should": [{"terms": {"access_type": ["ia_sim", "ia_file", "wayback"]}}]}}, "negative": {"bool": {"should": [{"bool": {"must_not": [{"exists": {"field": "title"}}]}}, {"bool": {"must_not": [{"exists": {"field": "year"}}]}}, {"bool": {"must_not": [{"exists": {"field": "type"}}]}}, {"bool": {"must_not": [{"exists": {"field": "stage"}}]}}, {"bool": {"must_not": [{"exists": {"field": "biblio.container_ident"}}]}}]}}, "negative_boost": 0.5}}]}}, "collapse": {"field": "collapse_key", "inner_hits": {"name": "more_pages", "size": 0}}, "from": 0, "size": 15}

    the
    => 7,369,084 Hits in 5.3sec, repeat in 5.1sec

Removing `poor_metadata` fields:

    tree
    => 1,521,663 Hits in 2.3sec, again in 2.2sec

    all but one removed...
    tree
    => 1,521,663 Hits in in 1.0sec, again in in 0.84sec

    3/5 negative...
    tree
    => 1,521,663 Hits in 3.5sec

    no boosting...
    tree
    => 1,521,663 Hits in 0.2sec

Testing "rescore" (with collapse disabled; `window_size`=50):


    search = search.query(basic_fulltext)
    search = search.extra(
        rescore={
            'window_size': 100,
            "query": {
                "rescore_query": Q(
                    "boosting",
                    positive=Q("bool", must=basic_fulltext, should=[has_fulltext],),
                    negative=poor_metadata,
                    negative_boost=0.5,
                ).to_dict(),
            },
        }
    )

    green; access:everything (rescoring)
    => 331,653 Hits in 0.05sec, again in 0.053sec

    *; access:everything (rescoring)
    => 93,043,404 Hits in 1.2sec, again in 1.2sec

    green; access:everything (rescoring)
    => 331,653 Hits in 0.041sec, again in 0.038sec

    *; access:everything (no boost)
    => 93,043,404 Hits in 1.1sec, again in 1.2sec

    green; access:everything (boost query)
    => 331,653 Hits in 0.96sec< again in 0.95sec

    *; access:everything (boost query)
    => 93,043,404 Hits in 13sec


Other notes:

    counting all records, default filters ("*")
        scholar-qa: 20,426,296 Hits in 7.4sec
        svc097: 20,426,310 Hits in 8.6sec

    "to be or not to be" hamlet
        scholar-qa: timeout, then 768 Hits in 0.73sec
        svc097: 768 Hits in 2.5sec, then 0.86 sec

    "to be or not to be"
        svc98: 16sec

Speculative notes:

querying more fields definitely seems heavy. should try `require_field_match`
with highlighter. to allow query and highlight fields to be separate? or
perhaps even a separate highlighter query. query "everything", highlight
specific fields.

scoring/boosting large reponses (more than a few hundred thousand hits) seems
expensive. this include the trivial '*' query.

some fulltext phrase queries seem to always be expensive. look in to phrase
indexing, eg term n-grams? looks like simple `index_phrases` parameter is
sufficient for the basic case

not a performance thing, but should revisit schema and field storage to reduce
size. eg, are we storing "exact" separately from stemming? does that increase
size? is fulltext.body and everything redundant?


TL;DR:
- scoring large result set (with boost) is slow (eg, "*"), but not bad for smaller result sets
    => confirmed this makes a difference, but can't do collapse at same time
- phrase queries are expensive, especially against fulltext
- query/match multiple fields is also proportionately expensive

TODO:
x index tweaks: smaller number types (eg, for year)
    https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html
    volume, issue, pages, contrib counts
x also sort and remove null keys when sending docs to ES
    => already done
x experiment with rescore for things like `has_fulltext` and metadata quality boost. with a large window?
x query on fewer fields and separate highlight fields from query fields (title, `biblio_all`, everything)
x consider not having `biblio_all.exact`
x enable `index_phrases` on at least `everything`, then reindex
    => start with ~1mil test batch
x consider not storing `everything` on disk at all, and maybe not `biblio_all` either (only use these for querying). some way to not make fulltext.body queryable?
- PROBLEM: can't do `collapse` and `rescore` together
    => try only a boolean query instead of boosting
        => at least superficially, no large difference
    x  special case "*" query and do no scoring, maybe even sort by `_doc`
        => huge difference for this specific query
    => could query twice: once with regular storing + collapse, but "halt
       after" short number of hits to reduce rescoring (?), and second time
       with no responses to get total count
    => could manually rescore in client code, just from the returned hits?

future questions:
- consider deserializing hit _source documents to pydantic objects (to avoid null field errors)
- how much of current disk usage is terms? will `index_phrase` make worse?
- do we need to store term offsets in indexes to make phrase queries faster/better, especially if the field is not stored?

Performance seems to have diverged between the two instances, not sure why.
Maybe some query terms just randomly are faster on one instance or the other?
Eg, "wood"

## 2020-08-07 Test Phrase Indexing

Indexing 1 million papers twice, with old and new schema, to check impact of
phrase indexing, in ES 7.x.

    release_export.2019-07-07.5mil_fulltext.json.gz

    git checkout 0c7a2ace5d7c5b357dd4afa708a07e3fa85849fd
    http put ":9200/qa_scholar_fulltext_0c7a2ace?include_type_name=true" < schema/scholar_fulltext.v01.json
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.20200723_two.json.gz \
        | gunzip \
        | head -n1000000 \
        | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
        | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_0c7a2ace -type _doc

    # master branch, phrase indexing
    git checkout 2c681e32756538c84b292cc95b623ee9758846a6
    http put ":9200/qa_scholar_fulltext_2c681e327?include_type_name=true" < schema/scholar_fulltext.v01.json
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.20200723_two.json.gz \
        | gunzip \
        | head -n1000000 \
        | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
        | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_2c681e327 -type _doc

    http get :9200/_cat/indices
    [...]
    green open qa_scholar_fulltext_0c7a2ace    BQ9tH5OZT0evFCXiIJMdUQ 12 0   1000000      0   6.7gb   6.7gb
    green open qa_scholar_fulltext_2c681e327   PgRMn5v-ReWzGlCTiP7b6g 12 0   1000000      0   9.5gb   9.5gb
    [...]

So phrase indexing is...42% larger index on disk, even with other changes to
reduce size. We will probably approach 2 TByte total index size.

    "to be or not to be"
    => qa_scholar_fulltext_0c7a2ace: 65 Hits in 0.2sec (after repetitions)
    => qa_scholar_fulltext_2c681e327: 65 Hits in 0.065sec

    to be or not to be
    => qa_scholar_fulltext_0c7a2ace: 87,586 Hits in 0.16sec
    => qa_scholar_fulltext_2c681e327: 87,590 Hits in 0.16sec

    "Besides all beneficial properties studied for various LAB, a special attention need to be pay on the possible cytotoxicity levels of the expressed bacteriocins"
    => qa_scholar_fulltext_0c7a2ace: 1 Hits in 0.076sec
    => qa_scholar_fulltext_2c681e327: 1 Hits in 0.055sec

    "insect swarm"
    => qa_scholar_fulltext_0c7a2ace: 4 Hits in 0.032sec
    => qa_scholar_fulltext_2c681e327: 4 Hits in 0.024sec

    "how to"
    => qa_scholar_fulltext_0c7a2ace: 15,761 Hits in 0.11sec
    => qa_scholar_fulltext_2c681e327: 15,763 Hits in 0.054sec

Sort of splitting hairs at this scale, but does seem like phrase indexing helps
with some queries. Seems worth at least trying with large/full index.

## 2020-08-07 Iterated Release Batch

Sharded indexing:

    zcat /fast/download/release_export_expanded.2020-08-05.json.gz | split --lines 25000000 - release_export_expanded.split_ -d --additional-suffix .json

    export TMPDIR=/sandcrawler-db/tmp
    for SHARD in {00..06}; do
        cat /bigger/scholar/release_export_expanded.split_$SHARD.json \
            | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
            | pv -l \
            | pigz > /grande/scholar/2020-12-30/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
    done

Record counts:

    24.7M 15:09:08 [ 452 /s]
    24.7M 16:11:22 [ 423 /s]
    24.7M 16:38:19 [ 412 /s]
    24.7M 17:29:46 [ 392 /s]
    24.7M 14:55:53 [ 459 /s]
    24.7M 15:02:49 [ 456 /s]
    2M 1:10:36 [ 472 /s]

Have made transform code changes, now at git rev 7603dd0ade23e22197acd1fd1d35962c314cf797.

Transform and index, on svc097 machine:

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_*.json.gz \
    | gunzip \
    | head -n2000000 \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc

Derp, got a batch-size error. But maybe was just a single huge doc? Added a
hack to try and skip transform of very large docs to start. In the future
should truncate specific fields (probably fulltext).

Ahah, actual error was:

    2020/08/12 23:19:15   {"mapper_parsing_exception" "failed to parse field [biblio.issue_int] of type [short] in document with id 'work_aezuqrgnnfcezkkeoyonr6ll54'. Preview of field's value: '48844'" "" "" ""}

Full indexing:

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_*.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | pv -l \
    | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
    2> /tmp/error.txt 1> /tmp/output.txt

Started: 2020-08-12 14:24

    6.71M 2:46:56 [ 590 /s]

Yikes, is this going to take 60 hours to index? CPU and disk seem to be
basically maxed out, so don't think tweaking batch size or parallelism would
help much.

NOTE: tail -n +700000
NOTE: could filter line size: awk 'length($0) < 16384'

Had some hardware (?) issue and had to restart.

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_{00..06}.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | pv -l \
    | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
    2> /tmp/error.txt 1> /tmp/output.txt

    => 150M 69:00:35 [ 604 /s]

    => green open scholar_fulltext_v01            2KrkdhuhRDa6SdNC36XR0A 12 0 150232272    130   1.3tb   1.3tb
    => Filesystem      Size  Used Avail Use% Mounted on
    => /dev/vda1       3.5T  1.4T  2.0T  42% /

    ssh aitio.us.archive.org cat /bigger/scholar_old/sim_intermediate.2020-07-23.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
    2> /tmp/error.txt 1> /tmp/output.txt

    => 2020/08/16 21:51:14 1895778 docs in 2h22m55.61416094s at 221.066 docs/s with 4 workers

    => green open scholar_fulltext_v01            2KrkdhuhRDa6SdNC36XR0A 12 0 152090351  26071   1.3tb   1.3tb
    => Filesystem      Size  Used Avail Use% Mounted on
    => /dev/vda1       3.5T  1.4T  2.0T  42% /

Stop elasticsearch, `sync`, restart, to ensure index is fully flushed to disk.

Some warm-up queries: "*", "blood", "to be or not to be"


## 2020-12-30 Simple Release Batch

Hopefully no special cases in this iteration!

    mkdir -p /grande/scholar/2020-12-30/
    cd /grande/scholar/2020-12-30/
    zcat /fast/download/release_export_expanded.2020-12-30.json.gz | split --lines 25000000 - release_export_expanded.split_ -d --additional-suffix .json

    export TMPDIR=/sandcrawler-db/tmp
    for SHARD in {00..06}; do
        cat /grande/scholar_index/2020-12-30/release_export_expanded.split_$SHARD.json \
            | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
            | pv -l \
            | pigz > /grande/scholar_index/2020-12-30/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
    done

Continuing 2020-01-16, on new focal elasticsearch 7.10 cluster:

    # commit: e5a5318829e1f3a08a2e0dbc252d839cc6f5e8f0
    http put ":9200/scholar_fulltext_v01?include_type_name=true" < schema/scholar_fulltext.v01.json

    http put ":9200/scholar_fulltext_v01/_settings" index.routing.allocation.include._name=wbgrp-svc500

    # start with single shard (00)
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_00.json.gz \
      | gunzip \
      | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
      | pv -l \
      | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
      2> /tmp/error.txt 1> /tmp/output.txt

Got an error:

    parallel: Error: Output is incomplete. Cannot append to buffer file in /tmp. Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
    Warning: unable to close filehandle properly: No space left on device during global destruction.

So added `--compress` and the `--tmpdir` (which needed to be created):

    # run other shards
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_{01..06}.json.gz \
      | gunzip \
      | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
      | pv -l \
      | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
      2> /tmp/error.txt 1> /tmp/output.txt

## 2021-06-06 Simple Iteration

Some new paths, more parallelism, and more conservative file naming/handling,
but otherwise not much changed from the 2020-12-30 run above.

    export JOBDIR=/kubwa/scholar/2021-06-03
    mkdir -p $JOBDIR
    cd $JOBDIR
    zcat /fast/release_export_expanded.json.gz | split --lines 8000000 - release_export_expanded.split_ -d --additional-suffix .json

    cd /fast/fatcat-scholar
    pipenv shell
    export TMPDIR=/sandcrawler-db/tmp

    # transform
    set -u -o pipefail
    for SHARD in {00..20}; do
        cat $JOBDIR/release_export_expanded.split_$SHARD.json \
            | parallel -j8 --line-buffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
            | pv -l \
            | pigz \
            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP \
            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
    done

    # dump refs
    set -u -o pipefail
    for SHARD in {00..20}; do
        zcat $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz \
            | pv -l \
            | parallel -j8 --linebuffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.transform run_refs \
            | pigz \
            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz.WIP \
            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz
    done

Ran in to a problem with a single (!) bad TEI-XML document, due to bad text
encoding:

    xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 40, column 1122

Root cause was an issue in GROBID, which seems to have been fixed in more
recent versions of GROBID. Patched to continue, and separately commited patch
to fatcat-scholar code base.

Ran several retries, manually.

Upload to petabox:

    export BASENAME=scholar_corpus_bundle_2021-06-03
    for SHARD in {00..20}; do
        ia upload ${BASENAME}_split-${SHARD} $JOBDIR/README.md $JOBDIR/fatcat_scholar_work_fulltext.split_${SHARD}.json.gz -m collection:"scholarly-tdm" --checksum
    done

    ia upload scholar_corpus_refs_2021-06-03 fatcat_scholar_work_fulltext.split_*.refs.json.gz -m collection:"scholarly-tdm" --checksum


### Performance Notes (on 2021-06-06 run)

Recently added crossref refs via sandcrawler-db postgrest lookup. Seem to still
be getting around 40/sec works per second, with a single thread, similar to
previous performance, so not a significant slow down.