SIM digitization is mostly complete, looking at starting to (re)index.

Around 2021-12-01, re-ran build of issue DB, with filters like
`(pub_type:"Scholarly Journals" OR pub_type:"Historical Journals" OR
pub_type:"Law Journals")`:

    ia search 'collection:periodicals collection:sim_microfilm mediatype:collection (pub_type:"Scholarly Journals" OR pub_type:"Historical Journals" OR pub_type:"Law Journals")' -n
    # 7947

    ia search 'collection:periodicals collection:sim_microfilm mediatype:texts !noindex:true (pub_type:"Scholarly Journals" OR pub_type:"Historical Journals" OR pub_type:"Law Journals")' -n
    # 849593

If filters are relaxed, and all microfilm included, could be:

    ia search 'collection:periodicals collection:sim_microfilm mediatype:collection' -n
    # 13590

    ia search 'collection:periodicals collection:sim_microfilm mediatype:texts' -n
    # 1870482

    ia search 'collection:periodicals collection:sim_microfilm mediatype:texts (pub_type:"Scholarly Journals" OR pub_type:"Historical Journals" OR pub_type:"Law Journals")' -n
    # 849599

With the tighter filters (initially), looking at:

    select count(*) from sim_pub;
    7902

    select count(*) from sim_pub where container_ident is not null;
    5973

    select count(*) from sim_pub where issn is not null;
    6107

    select pub_type, count(*) from sim_pub where issn is null group by pub_type;
    pub_type        count(*)
    Historical Journals     1568
    Law Journals    11
    Scholarly Journals      216

    select count(*) from sim_issue;
    667073

    select count(*) from sim_issue where release_count is null;
    373921

    select count(*) from sim_issue where release_count = 0;
    262125

    select sum(release_count) from sim_issue;
    534609

    select sum(release_count) from release_counts;
    # PREVIOUSLY: 397,602
    # 8,231,201

What container types do we have in fatcat?

    zcat container_export.json.gz | rg "sim_pubid" | jq .extra.ia.sim.pub_type | sort | uniq -c | sort -nr
       7135 "Scholarly Journals"
       2572 "Trade Journals"
       1325 "Magazines"
        187 "Law Journals"
        171 "Government Documents"
         21 "Historical Journals"

How many releases are we expecting to match?

    fatcat-cli search containers any_ia_sim:true --count
    # 11965

    fatcat-cli search release in_ia_sim:true --count
    # 22,470,053

    fatcat-cli search release in_ia_sim:true container_id:* --count
    # 22,470,053 (100%)

    fatcat-cli search release in_ia_sim:true container_id:* year:* --count
    # 22,470,053

    fatcat-cli search release in_ia_sim:true container_id:* volume:* --count
    # 20,498,018

    fatcat-cli search release in_ia_sim:true container_id:* year:* volume:* --count
    # 20,498,018

    fatcat-cli search release in_ia_sim:true container_id:* volume:* issue:* --count
    # 7,311,684

    fatcat-cli search release in_ia_sim:true container_id:* volume:* issue:* pages:* --count
    # 7,112,117

    fatcat-cli search release in_ia_sim:true container_id:* volume:* pages:* --count
    # 20,017,423

    fatcat-cli search release 'in_ia_sim:true container_id:* !issue:*' --count
    # 14,737,140

    fatcat-cli search release 'in_ia_sim:true container_id:* !issue:* doi:* doi_registrar:crossref' --count
    # 14,620,485

    fatcat-cli search release 'in_ia_sim:true container_id:* !issue:* doi:* doi_registrar:crossref in_ia:false' --count
    # 12,320,127

    fatcat-cli search scholar doc_type:work access_type:ia_sim --count
    # 66162

    fatcat-cli search scholar doc_type:sim_page --count
    # 10,448,586

The large majority of releases which *might* get included, are not. Missing an
issue number is the single largest category; almost all are Crossref DOIs; the
large majority are not in IA otherwise.

One conclusion from this is that updating fatcat with additional
volume/issue/page metadata (if available) could be valuable. Or, in the
short-term, copying this information from crossref metadata in the
fatcat-scholar pipeline would make sense.

There is still an open question of why so few of the ~7 million fatcat releases
which *should* match to SIM, are failing to. Is it because they have not been
processed? Or the issues are getting filtered?

----

Queries against chocula DB, to gauge possible breakdown by SIM `pub_type`:

    SELECT * FROM directory
    WHERE
        slug = 'sim'
        AND extra LIKE '%"Scholarly Journals"%'
    LIMIT 5;

    SELECT journal.issnl, journal.release_count
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND extra LIKE '%"Scholarly Journals"%'
        AND journal.issnl IS NOT NULL
    LIMIT 5;

    SELECT SUM(journal.release_count)
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND journal.issnl IS NOT NULL;
    # 40,579,513

    SELECT SUM(journal.release_count - journal.preserved_count)
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND journal.issnl IS NOT NULL;
    # 2,692,023

    SELECT SUM(journal.release_count)
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND extra LIKE '%"Scholarly Journals"%'
        AND journal.issnl IS NOT NULL;
    # 39,020,833

    SELECT SUM(journal.release_count)
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND extra LIKE '%"Trade Journals"%'
        AND journal.issnl IS NOT NULL;
    # 755,367

    SELECT SUM(journal.release_count)
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND extra LIKE '%"Magazines"%'
        AND journal.issnl IS NOT NULL;
    # 487,197

    SELECT SUM(journal.release_count)
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND extra LIKE '%"Law Journals"%'
        AND journal.issnl IS NOT NULL;
    # 78,519

    SELECT SUM(journal.release_count)
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND extra LIKE '%"Government Documents"%'
        AND journal.issnl IS NOT NULL;
    # 30,786

    SELECT SUM(journal.release_count)
    FROM directory
    LEFT JOIN journal ON journal.issnl = directory.issnl
    WHERE
        slug = 'sim'
        AND extra LIKE '%"Historical Journals"%'
        AND journal.issnl IS NOT NULL;
    # 206,811

To summarize counts, remembering that these apply to entire runs of journals
with any coverage in the SIM collection (not by year/volume/issue matching):

    Scholarly:      39,020,833
    Trade:             755,367
    Magazines:         487,197
    Law:                78,519
    Government:         30,786
    Historical:        206,811
    Total:          40,579,513
    "Unpreserved":   2,692,023

The TL;DR is that almost everything is "Scholarly Journals", with very little
"Law" or "Historical" coverage.

----

Experimented with a handful of examples, and it seems like newer-processed
(tesseract) SIM does work with the existing pipeline (eg, the djvu files still
work).

----

How many pages exist now, and how many do we expect to index?

    SELECT SUM(last_page-first_page) FROM sim_issue;
    # 83,504,593

    SELECT SUM(last_page-first_page) FROM sim_issue WHERE (release_count IS NULL OR release_count < 5);
    SELECT SUM(last_page-first_page) FROM sim_issue WHERE (release_count IS NULL OR release_count < 5);
    # 75,907,903

    fatcat-cli search scholar doc_type:sim_page --count
    # 10,448,586

Large increase, but not too wild.

Generate issues metadata:

    # in pipenv shell
    python -m fatcat_scholar.sim_pipeline run_print_issues \
        | shuf \
        | parallel -j16 --colsep "\t" python -m fatcat_scholar.sim_pipeline run_fetch_issue {1} {2} \
        | pv -l \
        | pigz \
        > /kubwa/scholar/2021-12-01/sim_intermediate.2021-12-01.json.gz
    # 20.2M 13:44:18 [ 408 /s]

If this runs at 300 pages/sec (aggregate), it will take 3-4 days to extract all
pages.

Huh, got only a fraction of what was expected. Lots of errors on individual
issues, but that should be fine/expected.

Let's try a sub-sample of 1000 issues, after also adding some print statements:

    python -m fatcat_scholar.sim_pipeline run_print_issues \
        | shuf -n1000 \
        | parallel -j16 --colsep "\t" python -m fatcat_scholar.sim_pipeline run_fetch_issue {1} {2} \
        | pv -l \
        | pigz \
        > /kubwa/scholar/2021-12-01/sim_intermediate.2021-12-01.1k_issues.json.gz
    # 76.1k 0:03:34 [ 354 /s]

    # issue without leaf numbers: sim_journal-of-organizational-and-end-user-computing_1989-1992_1-4_cumulative-index
    # issue without leaf numbers: sim_review-of-english-studies_1965_16_contents_0

How many issues are attempted?

    python -m fatcat_scholar.sim_pipeline run_print_issues \
        | wc -l
    # 268,707

    SELECT COUNT(*) FROM sim_issue LEFT JOIN sim_pub ON sim_issue.sim_pubid = sim_pub.sim_pubid WHERE sim_issue.release_count < 5;
    # 268,707

    SELECT COUNT(*) FROM sim_issue LEFT JOIN sim_pub ON sim_issue.sim_pubid = sim_pub.sim_pubid WHERE sim_issue.release_count < 5 OR sim_issue.release_count IS NULL;
    # 642,015

    SELECT COUNT(*) FROM sim_issue LEFT JOIN sim_pub ON sim_issue.sim_pubid = sim_pub.sim_pubid WHERE sim_issue.release_count < 100 OR sim_issue.release_count IS NULL;
    # 667,023

Not including the `OR sim_issue.release_count IS NULL` was a bug.

Also, should skip additional suffixes:
- _contents_0
- _cumulative-index
- _index-contents

With those changes:

    python -m fatcat_scholar.sim_pipeline run_print_issues \
        | wc -l
    # 627,304

Ok, start the dump again:

    mv /kubwa/scholar/2021-12-01/sim_intermediate.2021-12-01.json.gz /kubwa/scholar/2021-12-01/sim_intermediate.2021-12-01.partial.json.gz

    python -m fatcat_scholar.sim_pipeline run_print_issues \
        | shuf \
        | parallel -j16 --colsep "\t" python -m fatcat_scholar.sim_pipeline run_fetch_issue {1} {2} \
        | pv -l \
        | pigz \
        > /kubwa/scholar/2021-12-01/sim_intermediate.2021-12-01.json.gz
    # 43.5M 34:20:45 [ 351 /s]

Huh. Why is this still only 43 out of 75 million pages? Because of blank pages,
or something else? Should add counters to indexing process, write out a
per-issue log of counts and status. But good progress for now, I guess.