update (and move) ingest notes

author: Bryan Newbold <bnewbold@archive.org> 2020-03-03 10:24:43 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2020-03-03 10:24:43 -0800
commit: 720a45a1d9eea673e0f10d3a7dac0ca85fb913d3 (patch)
tree: 8b974774d7d8efeb85446911db73099fecbb667d /notes
parent: 46cd3516637fccd388bac6e0357d9ce7e3c7d8f1 (diff)
download: sandcrawler-720a45a1d9eea673e0f10d3a7dac0ca85fb913d3.tar.gz
sandcrawler-720a45a1d9eea673e0f10d3a7dac0ca85fb913d3.zip
6 files changed, 480 insertions, 0 deletions
diff --git a/notes/tasks/2020-02-04_ingest_backfills.md b/notes/ingest/2020-02-04_ingest_backfills.md
index 73a42ef..73a42ef 100644
--- a/notes/tasks/2020-02-04_ingest_backfills.md
+++ b/notes/ingest/2020-02-04_ingest_backfills.md
diff --git a/notes/ingest/2020-02-14_unpaywall_ingest.md b/notes/ingest/2020-02-14_unpaywall_ingest.md
new file mode 100644
index 0000000..df4795b
--- /dev/null
+++ b/notes/ingest/2020-02-14_unpaywall_ingest.md
@@ -0,0 +1,60 @@
+
+## Stats and Things
+
+    zcat unpaywall_snapshot_2019-11-22T074546.jsonl.gz | jq .oa_locations[].url_for_pdf -r | rg -v ^null | cut -f3 -d/ | sort | uniq -c | sort -nr > top_domains.txt
+
+## Transform
+
+    zcat unpaywall_snapshot_2019-11-22T074546.jsonl.gz | ./unpaywall2ingestrequest.py - | pv -l > /dev/null
+    => 22M 1:31:25 [   4k/s]
+
+Shard it into batches of roughly 1 million (all are 1098096 +/- 1):
+
+    zcat unpaywall_snapshot_2019-11-22.ingest_request.shuf.json.gz | split -n r/20 -d - unpaywall_snapshot_2019-11-22.ingest_request.split_ --additional-suffix=.json
+
+Test ingest:
+
+    head -n200 unpaywall_snapshot_2019-11-22.ingest_request.split_00.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Add a single batch like:
+
+    cat unpaywall_snapshot_2019-11-22.ingest_request.split_00.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+## Progress/Status
+
+There are 21,961,928 lines total, in batches of 1,098,097.
+
+    unpaywall_snapshot_2019-11-22.ingest_request.split_00.json
+        => 2020-02-24 21:05 local: 1,097,523    ~22 results/sec (combined)
+        => 2020-02-25 10:35 local: 0
+    unpaywall_snapshot_2019-11-22.ingest_request.split_01.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_02.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_03.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_04.json
+        => 2020-02-25 11:26 local: 4,388,997
+        => 2020-02-25 10:14 local: 1,115,821
+        => 2020-02-26 16:00 local:   265,116
+    unpaywall_snapshot_2019-11-22.ingest_request.split_05.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_06.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_07.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_08.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_09.json
+        => 2020-02-26 16:01 local: 6,843,708
+        => 2020-02-26 16:31 local: 4,839,618
+        => 2020-02-28 10:30 local: 2,619,319
+    unpaywall_snapshot_2019-11-22.ingest_request.split_10.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_11.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_12.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_13.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_14.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_15.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_16.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_17.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_18.json
+    unpaywall_snapshot_2019-11-22.ingest_request.split_19.json
+        => 2020-02-28 10:50 local: 13,551,887
+        => 2020-03-01 23:38 local:  4,521,076
+        => 2020-03-02 10:45 local:  2,827,071
+        => 2020-03-02 21:06 local:  1,257,176
+    added about 500k bulk re-ingest to try and work around cdx errors
+        => 2020-03-02 21:30 local:  1,733,654
diff --git a/notes/tasks/2020-02-18_ingest_backfills.md b/notes/ingest/2020-02-18_ingest_backfills.md
index 1ab18f4..1ab18f4 100644
--- a/notes/tasks/2020-02-18_ingest_backfills.md
+++ b/notes/ingest/2020-02-18_ingest_backfills.md
diff --git a/notes/tasks/2020-02-21_ingest_backfills.md b/notes/ingest/2020-02-21_ingest_backfills.md
index 48df910..48df910 100644
--- a/notes/tasks/2020-02-21_ingest_backfills.md
+++ b/notes/ingest/2020-02-21_ingest_backfills.md
diff --git a/notes/ingest/2020-02-22_fixed_domain.txt b/notes/ingest/2020-02-22_fixed_domain.txt
new file mode 100644
index 0000000..a60de42
--- /dev/null
+++ b/notes/ingest/2020-02-22_fixed_domain.txt
@@ -0,0 +1,246 @@
+
+www.degruyter.com
+
+    "/view/books/" didn't have citation_pdf_url, so added custom URL rule.
+
+    Not sure why redirect-loop happening, but isn't with current live ingest
+    tool?
+
+          domain       |         status          | count 
+    -------------------+-------------------------+-------
+     www.degruyter.com | redirect-loop           | 22023
+     www.degruyter.com | no-pdf-link             |  8773
+     www.degruyter.com | no-capture              |  8617
+     www.degruyter.com | success                 |   840
+     www.degruyter.com | link-loop               |    59
+     www.degruyter.com | terminal-bad-status     |    23
+     www.degruyter.com | wrong-mimetype          |    12
+     www.degruyter.com | spn-error               |     4
+     www.degruyter.com | spn2-cdx-lookup-failure |     4
+     www.degruyter.com | spn2-error:proxy-error  |     1
+     www.degruyter.com | spn-remote-error        |     1
+     www.degruyter.com | gateway-timeout         |     1
+     www.degruyter.com | petabox-error           |     1
+    (13 rows)
+
+www.frontiersin.org
+
+    no pdf link
+
+    seems to live ingest fine? files served from "*.blob.core.windows.net"
+    no fix, just re-ingest.
+
+           domain        |         status          | count 
+    ---------------------+-------------------------+-------
+     www.frontiersin.org | no-pdf-link             | 17503
+     www.frontiersin.org | terminal-bad-status     |  6696
+     www.frontiersin.org | wayback-error           |   203
+     www.frontiersin.org | no-capture              |    20
+     www.frontiersin.org | spn-error               |     6
+     www.frontiersin.org | gateway-timeout         |     3
+     www.frontiersin.org | wrong-mimetype          |     3
+     www.frontiersin.org | spn2-cdx-lookup-failure |     2
+     www.frontiersin.org | spn2-error:job-failed   |     2
+     www.frontiersin.org | spn-remote-error        |     1
+     www.frontiersin.org | cdx-error               |     1
+    (11 rows)
+
+www.mdpi.com
+
+    terminal-bad-status
+
+    Seems to ingest fine live? No fix, just re-ingest.
+
+        domain    |         status          | count 
+    --------------+-------------------------+-------
+     www.mdpi.com | terminal-bad-status     | 13866
+     www.mdpi.com | wrong-mimetype          |  2693
+     www.mdpi.com | wayback-error           |   513
+     www.mdpi.com | redirect-loop           |   505
+     www.mdpi.com | success                 |   436
+     www.mdpi.com | no-capture              |   214
+     www.mdpi.com | no-pdf-link             |    43
+     www.mdpi.com | spn2-cdx-lookup-failure |    34
+     www.mdpi.com | gateway-timeout         |     3
+     www.mdpi.com | petabox-error           |     2
+    (10 rows)
+
+www.ahajournals.org         | no-pdf-link         |   5727
+
+    SELECT domain, status, COUNT((domain, status))
+        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
+        WHERE t1.domain = 'www.ahajournals.org'
+        GROUP BY domain, status
+        ORDER BY COUNT DESC;
+
+    SELECT * FROM ingest_file_result
+        WHERE terminal_url LIKE '%www.ahajournals.org%'
+            AND status = 'no-pdf-link'
+        ORDER BY updated DESC
+        LIMIT 10;
+
+           domain        |     status     | count 
+    ---------------------+----------------+-------
+     www.ahajournals.org | no-pdf-link    |  5738
+     www.ahajournals.org | wrong-mimetype |    84
+    (2 rows)
+
+
+     pdf         | https://doi.org/10.1161/circ.110.19.2977     | 2020-02-23 00:28:55.256296+00 | f   | no-pdf-link | https://www.ahajournals.org/action/cookieAbsent | 20200217122952 |                  200 | 
+     pdf         | https://doi.org/10.1161/str.49.suppl_1.tp403 | 2020-02-23 00:27:34.950059+00 | f   | no-pdf-link | https://www.ahajournals.org/action/cookieAbsent | 20200217122952 |                  200 | 
+     pdf         | https://doi.org/10.1161/str.49.suppl_1.tp168 | 2020-02-23 00:25:54.611271+00 | f   | no-pdf-link | https://www.ahajournals.org/action/cookieAbsent | 20200217122952 |                  200 | 
+     pdf         | https://doi.org/10.1161/jaha.119.012131      | 2020-02-23 00:24:44.244511+00 | f   | no-pdf-link | https://www.ahajournals.org/action/cookieAbsent | 20200217122952 |                  200 | 
+
+    Ah, the ol' annoying 'cookieAbsent'. Works with live SPNv2 via soft-404
+    detection, but that status wasn't coming through, and needed custom
+    pdf-link detection.
+
+    FIXED: added pdf-link detection
+
+ehp.niehs.nih.gov           | no-pdf-link         |   5772
+
+    simple custom URL format. but are they also blocking?
+
+    SELECT domain, status, COUNT((domain, status))
+        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
+        WHERE t1.domain = 'ehp.niehs.nih.gov'
+        GROUP BY domain, status
+        ORDER BY COUNT DESC;
+
+          domain       |     status     | count 
+    -------------------+----------------+-------
+     ehp.niehs.nih.gov | no-pdf-link    |  5791
+     ehp.niehs.nih.gov | wrong-mimetype |    11
+    (2 rows)
+
+    FIXED: mostly just slow, custom URL seems to work
+
+journals.tsu.ru             | no-pdf-link         |   4404
+
+    SELECT domain, status, COUNT((domain, status))
+        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
+        WHERE t1.domain = 'journals.tsu.ru'
+        GROUP BY domain, status
+        ORDER BY COUNT DESC;
+
+    SELECT * FROM ingest_file_result
+        WHERE terminal_url LIKE '%journals.tsu.ru%'
+            AND status = 'no-pdf-link'
+        ORDER BY updated DESC
+        LIMIT 10;
+
+         domain      |     status     | count 
+    -----------------+----------------+-------
+     journals.tsu.ru | no-pdf-link    |  4409
+     journals.tsu.ru | success        |     1
+     journals.tsu.ru | wrong-mimetype |     1
+    (3 rows)
+
+
+    pdf         | https://doi.org/10.17223/18572685/57/3   | 2020-02-23 00:45:49.003593+00 | f   | no-pdf-link | http://journals.tsu.ru/rusin/&journal_page=archive&id=1907&article_id=42847      | 20200213132322 |                  200 | 
+    pdf         | https://doi.org/10.17223/17267080/71/4   | 2020-02-23 00:31:25.715416+00 | f   | no-pdf-link | http://journals.tsu.ru/psychology/&journal_page=archive&id=1815&article_id=40405 | 20200211151825 |                  200 | 
+    pdf         | https://doi.org/10.17223/15617793/399/33 | 2020-02-23 00:29:45.414865+00 | f   | no-pdf-link | http://journals.tsu.ru/vestnik/&journal_page=archive&id=1322&article_id=24619    | 20200208152715 |                  200 | 
+    pdf         | https://doi.org/10.17223/19988613/58/15  | 2020-02-23 00:25:24.402838+00 | f   | no-pdf-link | http://journals.tsu.ru//history/&journal_page=archive&id=1827&article_id=40501   | 20200212200320 |                  200 | 
+
+    FIXED: simple new custom PDF link pattern
+
+www.cogentoa.com            | no-pdf-link         |   4282
+
+    SELECT domain, status, COUNT((domain, status))
+        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
+        WHERE t1.domain = 'www.cogentoa.com'
+        GROUP BY domain, status
+        ORDER BY COUNT DESC;
+
+    SELECT * FROM ingest_file_result
+        WHERE terminal_url LIKE '%www.cogentoa.com%'
+            AND status = 'no-pdf-link'
+        ORDER BY updated DESC
+        LIMIT 10;
+
+          domain      |   status    | count 
+    ------------------+-------------+-------
+     www.cogentoa.com | no-pdf-link |  4296
+    (1 row)
+
+     pdf         | https://doi.org/10.1080/23311932.2015.1022632 | 2020-02-23 01:06:14.040013+00 | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/23311932.2015.1022632 | 20200208054228 |                  200 |
+     pdf         | https://doi.org/10.1080/23322039.2020.1730079 | 2020-02-23 01:04:53.754117+00 | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/23322039.2020.1730079 | 20200223010431 |                  200 |
+     pdf         | https://doi.org/10.1080/2331186x.2018.1460901 | 2020-02-23 01:04:03.47563+00  | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/2331186X.2018.1460901 | 20200207200958 |                  200 |
+     pdf         | https://doi.org/10.1080/23311975.2017.1412873 | 2020-02-23 01:03:08.063545+00 | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/23311975.2017.1412873 | 20200209034602 |                  200 |
+     pdf         | https://doi.org/10.1080/23311916.2017.1293481 | 2020-02-23 01:02:42.868424+00 | f   | no-pdf-link | https://www.cogentoa.com/article/10.1080/23311916.2017.1293481 | 20200208101623 |                  200 |
+
+    FIXED: simple custom URL-based pattern
+
+chemrxiv.org                | no-pdf-link         |   4186
+
+    SELECT domain, status, COUNT((domain, status))
+        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
+        WHERE t1.domain = 'chemrxiv.org'
+        GROUP BY domain, status
+        ORDER BY COUNT DESC;
+
+    SELECT * FROM ingest_file_result
+        WHERE terminal_url LIKE '%chemrxiv.org%'
+            AND status = 'no-pdf-link'
+        ORDER BY updated DESC
+        LIMIT 10;
+
+        domain    |         status          | count
+    --------------+-------------------------+-------
+     chemrxiv.org | no-pdf-link             |  4202
+     chemrxiv.org | wrong-mimetype          |    64
+     chemrxiv.org | wayback-error           |    14
+     chemrxiv.org | success                 |    12
+     chemrxiv.org | terminal-bad-status     |     4
+     chemrxiv.org | spn2-cdx-lookup-failure |     1
+
+    pdf         | https://doi.org/10.26434/chemrxiv.9912812.v1  | 2020-02-23 01:08:34.585084+00 | f   | no-pdf-link | https://chemrxiv.org/articles/Proximity_Effect_in_Crystalline_Framework_Materials_Stacking-Induced_Functionality_in_MOFs_and_COFs/9912812/1                                                                     | 20200215072929 |                  200 | 
+    pdf         | https://doi.org/10.26434/chemrxiv.7150097     | 2020-02-23 01:05:48.957624+00 | f   | no-pdf-link | https://chemrxiv.org/articles/Systematic_Engineering_of_a_Protein_Nanocage_for_High-Yield_Site-Specific_Modification/7150097                                                                                    | 20200213002430 |                  200 | 
+    pdf         | https://doi.org/10.26434/chemrxiv.7833500.v1  | 2020-02-23 00:55:41.013109+00 | f   | no-pdf-link | https://chemrxiv.org/articles/Formation_of_Neutral_Peptide_Aggregates_Studied_by_Mass_Selective_IR_Action_Spectroscopy/7833500/1                                                                                | 20200210131343 |                  200 | 
+    pdf         | https://doi.org/10.26434/chemrxiv.8146103     | 2020-02-23 00:52:00.193328+00 | f   | no-pdf-link | https://chemrxiv.org/articles/On-Demand_Guest_Release_from_MOF-5_Sealed_with_Nitrophenylacetic_Acid_Photocapping_Groups/8146103                                                                                 | 20200207215449 |                  200 | 
+    pdf         | https://doi.org/10.26434/chemrxiv.10101419    | 2020-02-23 00:46:14.086913+00 | f   | no-pdf-link | https://chemrxiv.org/articles/Biradical_Formation_by_Deprotonation_in_Thiazole-Derivatives_The_Hidden_Nature_of_Dasatinib/10101419                                                                              | 20200214044153 |                  200 | 
+
+    FIXED: complex JSON PDF url extraction; maybe for all figshare?
+
+TODO:
+x many datacite prefixes go to IRs, but have is_oa:false. we should probably crawl by default based on release_type
+    => fatcat branch bnewbold-more-ingest
+- re-ingest all degruyter (doi_prefix:10.1515)
+    1456169 doi:10.1515\/*
+    89942   doi:10.1515\/* is_oa:true
+    36350   doi:10.1515\/* in_ia:false is_oa:true
+    1290830 publisher:Gruyter
+    88944   publisher:Gruyter is_oa:true
+    40034   publisher:Gruyter is_oa:true in_ia:false
+- re-ingest all frontiersin
+    248165  publisher:frontiers
+    161996  publisher:frontiers is_oa:true
+    36093   publisher:frontiers is_oa:true in_ia:false
+    121001  publisher:frontiers in_ia:false
+- re-ingest all mdpi
+    43114   publisher:mdpi is_oa:true in_ia:false
+- re-ingest all ahajournals.org
+    132000  doi:10.1161\/*
+    6606    doi:10.1161\/* in_ia:false is_oa:true
+    81349   publisher:"American Heart Association"
+    5986    publisher:"American Heart Association" is_oa:true in_ia:false
+- re-ingest all ehp.niehs.nih.gov
+    25522   doi:10.1289\/*
+    15315   publisher:"Environmental Health Perspectives"
+     8779   publisher:"Environmental Health Perspectives" in_ia:false
+    12707   container_id:3w6amv3ecja7fa3ext35ndpiky in_ia:false is_oa:true
+- re-ingest all journals.tsu.ru
+    12232   publisher:"Tomsk State University"
+    11668   doi:10.17223\/*
+     4861   publisher:"Tomsk State University" in_ia:false is_oa:true
+- re-ingest all www.cogentoa.com
+    3421898 doi:10.1080\/*
+    4602    journal:cogent is_oa:true in_ia:false
+    5631    journal:cogent is_oa:true (let's recrawl all from publisher domain)
+- re-ingest chemrxiv
+    8281    doi:10.26434\/chemrxiv*
+    6918    doi:10.26434\/chemrxiv* in_ia:false
+
+Submit all the above with limits of 1000, then follow up later to check that
+there was success?
+
diff --git a/notes/ingest/2020-03-02_ingests.txt b/notes/ingest/2020-03-02_ingests.txt
new file mode 100644
index 0000000..e98ef33
--- /dev/null
+++ b/notes/ingest/2020-03-02_ingests.txt
@@ -0,0 +1,174 @@
+
+## protocols.io
+
+Tested that single ingest is working, and they fixed PDF format on their end
+recently.
+
+    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa container --name protocols.io
+    => Expecting 8448 release objects in search queries
+    => Counter({'estimate': 8448, 'kafka': 8448, 'ingest_request': 8448, 'elasticsearch_release': 8448})
+
+## backfill follow-ups
+
+- re-ingest all degruyter (doi_prefix:10.1515)
+    89942   doi:10.1515\/* is_oa:true
+    36350   doi:10.1515\/* in_ia:false is_oa:true
+    40034   publisher:Gruyter is_oa:true in_ia:false
+    => update:
+    135926  doi:10.1515\/* is_oa:true
+    50544   doi:10.1515\/* in_ia:false is_oa:true
+    54880   publisher:Gruyter is_oa:true in_ia:false
+- re-ingest all frontiersin
+    36093   publisher:frontiers is_oa:true in_ia:false
+    => update
+    22444   publisher:frontiers is_oa:true in_ia:false
+    22029   doi_prefix:10.3389 is_oa:true in_ia:false
+
+    select status, count(*) from ingest_file_result where base_url like 'https://doi.org/10.3389/%' group by status order by count(*) desc;
+
+                   status                | count 
+    -------------------------------------+-------
+     success                             | 34721
+     no-pdf-link                         | 18157
+     terminal-bad-status                 |  6799
+     cdx-error                           |  1805
+     wayback-error                       |   333
+     no-capture                          |   301
+    [...]
+
+    select * from ingest_file_result where base_url like 'https://doi.org/10.17723/aarc%' and status = 'no-pdf-link' order by updated desc limit 100;
+
+- re-ingest all mdpi
+    43114   publisher:mdpi is_oa:true in_ia:false
+    => update
+    8548    publisher:mdpi is_oa:true in_ia:false
+
+    select status, count(*) from ingest_file_result where base_url like 'https://doi.org/10.3390/%' group by status order by count(*) desc;
+                   status                | count  
+    -------------------------------------+--------
+     success                             | 108971
+     cdx-error                           |   6655
+     wrong-mimetype                      |   3359
+     terminal-bad-status                 |   1299
+     wayback-error                       |    151
+     spn2-cdx-lookup-failure             |     87
+
+     => added hack for gzip content-encoding coming through pdf fetch
+     => will re-ingest all after pushing fix
+
+- re-ingest all ahajournals.org
+    132000  doi:10.1161\/*
+    6606    doi:10.1161\/* in_ia:false is_oa:true
+    81349   publisher:"American Heart Association"
+    5986    publisher:"American Heart Association" is_oa:true in_ia:false
+    => update
+    1337    publisher:"American Heart Association" is_oa:true in_ia:false
+
+                   status                | count 
+    -------------------------------------+-------
+     success                             |  1480
+     cdx-error                           |  1176
+     spn2-cdx-lookup-failure             |   514
+     no-pdf-link                         |    85
+     wayback-error                       |    25
+     spn2-error:job-failed               |    18
+
+    => will re-run errors
+- re-ingest all ehp.niehs.nih.gov
+    25522   doi:10.1289\/*
+    15315   publisher:"Environmental Health Perspectives"
+     8779   publisher:"Environmental Health Perspectives" in_ia:false
+    12707   container_id:3w6amv3ecja7fa3ext35ndpiky in_ia:false is_oa:true
+    => update
+    7547    container_id:3w6amv3ecja7fa3ext35ndpiky in_ia:false is_oa:true
+- re-ingest all journals.tsu.ru
+    12232   publisher:"Tomsk State University"
+    11668   doi:10.17223\/*
+     4861   publisher:"Tomsk State University" in_ia:false is_oa:true
+    => update
+    2605    publisher:"Tomsk State University" in_ia:false is_oa:true
+    => just need to retry these? seem fine
+- re-ingest all www.cogentoa.com
+    3421898 doi:10.1080\/*
+    4602    journal:cogent is_oa:true in_ia:false
+    5631    journal:cogent is_oa:true (let's recrawl all from publisher domain)
+    => update
+    254     journal:cogent is_oa:true in_ia:false
+- re-ingest chemrxiv
+    8281    doi:10.26434\/chemrxiv*
+    6918    doi:10.26434\/chemrxiv* in_ia:false
+    => update
+    4890    doi:10.26434\/chemrxiv* in_ia:false
+    => re-ingest
+    => allow non-OA
+
+    # american archivist
+    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa container --container-id zpobyv4vbranllc7oob56tgci4
+    Counter({'estimate': 2920, 'elasticsearch_release': 2920, 'kafka': 2911, 'ingest_request': 2911})
+    => 2020-02-04: 85 / 3,005
+    => 2020-03-02: 2,182 / 3,005 preserved. some no-pdf-link, otherwise just a bunch of spn2-error
+    => looks like the no-pdf-url due to pinnacle-secure.allenpress.com soft-blocking loop
+
+
+## backfill re-ingests
+
+    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa --force-recrawl container --container-id zpobyv4vbranllc7oob56tgci4
+    => Counter({'elasticsearch_release': 823, 'estimate': 823, 'ingest_request': 814, 'kafka': 814})
+
+    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org container --publisher Gruyter
+    => Counter({'elasticsearch_release': 54880, 'estimate': 54880, 'kafka': 51497, 'ingest_request': 51497})
+
+    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org query 'publisher:"Tomsk State University"'
+    => Counter({'ingest_request': 2605, 'kafka': 2605, 'elasticsearch_release': 2605, 'estimate': 2605})
+
+    ./fatcat_ingest.py --limit 25 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org query "doi:10.26434\/chemrxiv*"
+
+    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org container --publisher mdpi
+    => Counter({'estimate': 8548, 'elasticsearch_release': 8548, 'ingest_request': 6693, 'kafka': 6693})
+    => NOTE: about 2k not enqueued
+
+## re-ingest all broken
+
+    COPY (
+        SELECT row_to_json(ingest_request.*) FROM ingest_request
+        LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url
+        WHERE ingest_request.ingest_type = 'pdf'
+            AND ingest_file_result.ingest_type = 'pdf'
+            AND ingest_file_result.updated < NOW() - '1 day'::INTERVAL
+            AND ingest_file_result.hit = false
+            AND ingest_file_result.status like 'spn2-%'
+    ) TO '/grande/snapshots/reingest_spn2_20200302.rows.json';
+    => COPY 14849
+
+    COPY (
+        SELECT row_to_json(ingest_request.*) FROM ingest_request
+        LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url
+        WHERE ingest_request.ingest_type = 'pdf'
+            AND ingest_file_result.ingest_type = 'pdf'
+            AND ingest_file_result.hit = false
+            AND ingest_file_result.status like 'cdx-error'
+    ) TO '/grande/snapshots/reingest_cdxerr_20200302.rows.json';
+    => COPY 507610
+
+    This is a huge number! Re-ingest via bulk?
+
+Transform:
+
+    ./scripts/ingestrequest_row2json.py /grande/snapshots/reingest_spn2_20200302.rows.json > reingest_spn2_20200302.json
+    ./scripts/ingestrequest_row2json.py /grande/snapshots/reingest_cdxerr_20200302.rows.json > reingest_cdxerr_20200302.json
+
+Push to kafka:
+
+    cat reingest_spn2err_20200218.json | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p -1
+    # accidentially also piped the above through ingest-file-requests-bulk...
+    # which could actually be bad
+    cat reingest_cdxerr_20200302.json | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+## biorxiv/medrxiv
+
+    8026    doi:10.1101\/20*
+    2159    doi:10.1101\/20* in_ia:false
+
+    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa query 'doi:10.1101\/20* in_ia:false'
+    => Counter({'estimate': 2159, 'ingest_request': 2159, 'elasticsearch_release': 2159, 'kafka': 2159})
+
author	Bryan Newbold <bnewbold@archive.org>	2020-03-03 10:24:43 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2020-03-03 10:24:43 -0800
commit	720a45a1d9eea673e0f10d3a7dac0ca85fb913d3 (patch)
tree	8b974774d7d8efeb85446911db73099fecbb667d /notes
parent	46cd3516637fccd388bac6e0357d9ce7e3c7d8f1 (diff)
download	sandcrawler-720a45a1d9eea673e0f10d3a7dac0ca85fb913d3.tar.gz sandcrawler-720a45a1d9eea673e0f10d3a7dac0ca85fb913d3.zip