www.degruyter.com

    "/view/books/" didn't have citation_pdf_url, so added custom URL rule.

    Not sure why redirect-loop happening, but isn't with current live ingest
    tool?

          domain       |         status          | count 
    -------------------+-------------------------+-------
     www.degruyter.com | redirect-loop           | 22023
     www.degruyter.com | no-pdf-link             |  8773
     www.degruyter.com | no-capture              |  8617
     www.degruyter.com | success                 |   840
     www.degruyter.com | link-loop               |    59
     www.degruyter.com | terminal-bad-status     |    23
     www.degruyter.com | wrong-mimetype          |    12
     www.degruyter.com | spn-error               |     4
     www.degruyter.com | spn2-cdx-lookup-failure |     4
     www.degruyter.com | spn2-error:proxy-error  |     1
     www.degruyter.com | spn-remote-error        |     1
     www.degruyter.com | gateway-timeout         |     1
     www.degruyter.com | petabox-error           |     1
    (13 rows)

www.frontiersin.org

    no pdf link

    seems to live ingest fine? files served from "*.blob.core.windows.net"
    no fix, just re-ingest.

           domain        |         status          | count 
    ---------------------+-------------------------+-------
     www.frontiersin.org | no-pdf-link             | 17503
     www.frontiersin.org | terminal-bad-status     |  6696
     www.frontiersin.org | wayback-error           |   203
     www.frontiersin.org | no-capture              |    20
     www.frontiersin.org | spn-error               |     6
     www.frontiersin.org | gateway-timeout         |     3
     www.frontiersin.org | wrong-mimetype          |     3
     www.frontiersin.org | spn2-cdx-lookup-failure |     2
     www.frontiersin.org | spn2-error:job-failed   |     2
     www.frontiersin.org | spn-remote-error        |     1
     www.frontiersin.org | cdx-error               |     1
    (11 rows)

www.mdpi.com

    terminal-bad-status

    Seems to ingest fine live? No fix, just re-ingest.

        domain    |         status          | count 
    --------------+-------------------------+-------
     www.mdpi.com | terminal-bad-status     | 13866
     www.mdpi.com | wrong-mimetype          |  2693
     www.mdpi.com | wayback-error           |   513
     www.mdpi.com | redirect-loop           |   505
     www.mdpi.com | success                 |   436
     www.mdpi.com | no-capture              |   214
     www.mdpi.com | no-pdf-link             |    43
     www.mdpi.com | spn2-cdx-lookup-failure |    34
     www.mdpi.com | gateway-timeout         |     3
     www.mdpi.com | petabox-error           |     2
    (10 rows)

www.ahajournals.org         | no-pdf-link         |   5727

    SELECT domain, status, COUNT((domain, status))
        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
        WHERE t1.domain = 'www.ahajournals.org'
        GROUP BY domain, status
        ORDER BY COUNT DESC;

    SELECT * FROM ingest_file_result
        WHERE terminal_url LIKE '%www.ahajournals.org%'
            AND status = 'no-pdf-link'
        ORDER BY updated DESC
        LIMIT 10;

           domain        |     status     | count 
    ---------------------+----------------+-------
     www.ahajournals.org | no-pdf-link    |  5738
     www.ahajournals.org | wrong-mimetype |    84
    (2 rows)


     pdf         | https://doi.org/10.1161/circ.110.19.2977     | 2020-02-23 00:28:55.256296+00 | f   | no-pdf-link | https://www.ahajournals.org/action/cookieAbsent | 20200217122952 |                  200 | 
     pdf         | https://doi.org/10.1161/str.49.suppl_1.tp403 | 2020-02-23 00:27:34.950059+00 | f   | no-pdf-link | https://www.ahajournals.org/action/cookieAbsent | 20200217122952 |                  200 | 
     pdf         | https://doi.org/10.1161/str.49.suppl_1.tp168 | 2020-02-23 00:25:54.611271+00 | f   | no-pdf-link | https://www.ahajournals.org/action/cookieAbsent | 20200217122952 |                  200 | 
     pdf         | https://doi.org/10.1161/jaha.119.012131      | 2020-02-23 00:24:44.244511+00 | f   | no-pdf-link | https://www.ahajournals.org/action/cookieAbsent | 20200217122952 |                  200 | 

    Ah, the ol' annoying 'cookieAbsent'. Works with live SPNv2 via soft-404
    detection, but that status wasn't coming through, and needed custom
    pdf-link detection.

    FIXED: added pdf-link detection

ehp.niehs.nih.gov           | no-pdf-link         |   5772

    simple custom URL format. but are they also blocking?

    SELECT domain, status, COUNT((domain, status))
        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
        WHERE t1.domain = 'ehp.niehs.nih.gov'
        GROUP BY domain, status
        ORDER BY COUNT DESC;

          domain       |     status     | count 
    -------------------+----------------+-------
     ehp.niehs.nih.gov | no-pdf-link    |  5791
     ehp.niehs.nih.gov | wrong-mimetype |    11
    (2 rows)

    FIXED: mostly just slow, custom URL seems to work

journals.tsu.ru             | no-pdf-link         |   4404

    SELECT domain, status, COUNT((domain, status))
        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
        WHERE t1.domain = 'journals.tsu.ru'
        GROUP BY domain, status
        ORDER BY COUNT DESC;

    SELECT * FROM ingest_file_result
        WHERE terminal_url LIKE '%journals.tsu.ru%'
            AND status = 'no-pdf-link'
        ORDER BY updated DESC
        LIMIT 10;

         domain      |     status     | count 
    -----------------+----------------+-------
     journals.tsu.ru | no-pdf-link    |  4409
     journals.tsu.ru | success        |     1
     journals.tsu.ru | wrong-mimetype |     1
    (3 rows)


    pdf         | https://doi.org/10.17223/18572685/57/3   | 2020-02-23 00:45:49.003593+00 | f   | no-pdf-link | http://journals.tsu.ru/rusin/&journal_page=archive&id=1907&article_id=42847      | 20200213132322 |                  200 | 
    pdf         | https://doi.org/10.17223/17267080/71/4   | 2020-02-23 00:31:25.715416+00 | f   | no-pdf-link | http://journals.tsu.ru/psychology/&journal_page=archive&id=1815&article_id=40405 | 20200211151825 |                  200 | 
    pdf         | https://doi.org/10.17223/15617793/399/33 | 2020-02-23 00:29:45.414865+00 | f   | no-pdf-link | http://journals.tsu.ru/vestnik/&journal_page=archive&id=1322&article_id=24619    | 20200208152715 |                  200 | 
    pdf         | https://doi.org/10.17223/19988613/58/15  | 2020-02-23 00:25:24.402838+00 | f   | no-pdf-link | http://journals.tsu.ru//history/&journal_page=archive&id=1827&article_id=40501   | 20200212200320 |                  200 | 

    FIXED: simple new custom PDF link pattern

www.cogentoa.com            | no-pdf-link         |   4282

    SELECT domain, status, COUNT((domain, status))
        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
        WHERE t1.domain = 'www.cogentoa.com'
        GROUP BY domain, status
        ORDER BY COUNT DESC;

    SELECT * FROM ingest_file_result
        WHERE terminal_url LIKE '%www.cogentoa.com%'
            AND status = 'no-pdf-link'
        ORDER BY updated DESC
        LIMIT 10;

          domain      |   status    | count 
    ------------------+-------------+-------
     www.cogentoa.com | no-pdf-link |  4296
    (1 row)

     pdf         | https://doi.org/10.1080/23311932.2015.1022632 | 2020-02-23 01:06:14.040013+00 | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/23311932.2015.1022632 | 20200208054228 |                  200 |
     pdf         | https://doi.org/10.1080/23322039.2020.1730079 | 2020-02-23 01:04:53.754117+00 | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/23322039.2020.1730079 | 20200223010431 |                  200 |
     pdf         | https://doi.org/10.1080/2331186x.2018.1460901 | 2020-02-23 01:04:03.47563+00  | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/2331186X.2018.1460901 | 20200207200958 |                  200 |
     pdf         | https://doi.org/10.1080/23311975.2017.1412873 | 2020-02-23 01:03:08.063545+00 | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/23311975.2017.1412873 | 20200209034602 |                  200 |
     pdf         | https://doi.org/10.1080/23311916.2017.1293481 | 2020-02-23 01:02:42.868424+00 | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/23311916.2017.1293481 | 20200208101623 |                  200 |

    FIXED: simple custom URL-based pattern

chemrxiv.org                | no-pdf-link         |   4186

    SELECT domain, status, COUNT((domain, status))
        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
        WHERE t1.domain = 'chemrxiv.org'
        GROUP BY domain, status
        ORDER BY COUNT DESC;

    SELECT * FROM ingest_file_result
        WHERE terminal_url LIKE '%chemrxiv.org%'
            AND status = 'no-pdf-link'
        ORDER BY updated DESC
        LIMIT 10;

        domain    |         status          | count
    --------------+-------------------------+-------
     chemrxiv.org | no-pdf-link             |  4202
     chemrxiv.org | wrong-mimetype          |    64
     chemrxiv.org | wayback-error           |    14
     chemrxiv.org | success                 |    12
     chemrxiv.org | terminal-bad-status     |     4
     chemrxiv.org | spn2-cdx-lookup-failure |     1

    pdf         | https://doi.org/10.26434/chemrxiv.9912812.v1  | 2020-02-23 01:08:34.585084+00 | f   | no-pdf-link | https://chemrxiv.org/articles/Proximity_Effect_in_Crystalline_Framework_Materials_Stacking-Induced_Functionality_in_MOFs_and_COFs/9912812/1                                                                     | 20200215072929 |                  200 | 
    pdf         | https://doi.org/10.26434/chemrxiv.7150097     | 2020-02-23 01:05:48.957624+00 | f   | no-pdf-link | https://chemrxiv.org/articles/Systematic_Engineering_of_a_Protein_Nanocage_for_High-Yield_Site-Specific_Modification/7150097                                                                                    | 20200213002430 |                  200 | 
    pdf         | https://doi.org/10.26434/chemrxiv.7833500.v1  | 2020-02-23 00:55:41.013109+00 | f   | no-pdf-link | https://chemrxiv.org/articles/Formation_of_Neutral_Peptide_Aggregates_Studied_by_Mass_Selective_IR_Action_Spectroscopy/7833500/1                                                                                | 20200210131343 |                  200 | 
    pdf         | https://doi.org/10.26434/chemrxiv.8146103     | 2020-02-23 00:52:00.193328+00 | f   | no-pdf-link | https://chemrxiv.org/articles/On-Demand_Guest_Release_from_MOF-5_Sealed_with_Nitrophenylacetic_Acid_Photocapping_Groups/8146103                                                                                 | 20200207215449 |                  200 | 
    pdf         | https://doi.org/10.26434/chemrxiv.10101419    | 2020-02-23 00:46:14.086913+00 | f   | no-pdf-link | https://chemrxiv.org/articles/Biradical_Formation_by_Deprotonation_in_Thiazole-Derivatives_The_Hidden_Nature_of_Dasatinib/10101419                                                                              | 20200214044153 |                  200 | 

    FIXED: complex JSON PDF url extraction; maybe for all figshare?

TODO:
x many datacite prefixes go to IRs, but have is_oa:false. we should probably crawl by default based on release_type
    => fatcat branch bnewbold-more-ingest
- re-ingest all degruyter (doi_prefix:10.1515)
    1456169 doi:10.1515\/*
    89942   doi:10.1515\/* is_oa:true
    36350   doi:10.1515\/* in_ia:false is_oa:true
    1290830 publisher:Gruyter
    88944   publisher:Gruyter is_oa:true
    40034   publisher:Gruyter is_oa:true in_ia:false
- re-ingest all frontiersin
    248165  publisher:frontiers
    161996  publisher:frontiers is_oa:true
    36093   publisher:frontiers is_oa:true in_ia:false
    121001  publisher:frontiers in_ia:false
- re-ingest all mdpi
    43114   publisher:mdpi is_oa:true in_ia:false
- re-ingest all ahajournals.org
    132000  doi:10.1161\/*
    6606    doi:10.1161\/* in_ia:false is_oa:true
    81349   publisher:"American Heart Association"
    5986    publisher:"American Heart Association" is_oa:true in_ia:false
- re-ingest all ehp.niehs.nih.gov
    25522   doi:10.1289\/*
    15315   publisher:"Environmental Health Perspectives"
     8779   publisher:"Environmental Health Perspectives" in_ia:false
    12707   container_id:3w6amv3ecja7fa3ext35ndpiky in_ia:false is_oa:true
- re-ingest all journals.tsu.ru
    12232   publisher:"Tomsk State University"
    11668   doi:10.17223\/*
     4861   publisher:"Tomsk State University" in_ia:false is_oa:true
- re-ingest all www.cogentoa.com
    3421898 doi:10.1080\/*
    4602    journal:cogent is_oa:true in_ia:false
    5631    journal:cogent is_oa:true (let's recrawl all from publisher domain)
- re-ingest chemrxiv
    8281    doi:10.26434\/chemrxiv*
    6918    doi:10.26434\/chemrxiv* in_ia:false

Submit all the above with limits of 1000, then follow up later to check that
there was success?