ingests: normalize file names; commit updates

author: Bryan Newbold <bnewbold@archive.org> 2020-05-26 14:47:17 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2020-05-26 14:47:17 -0700
commit: 5dd8785d710cf7d067afdc691069bfa74406e06a (patch)
tree: 8ff16b25cee10f38127caf7fdb266d41fea12d83 /notes/ingest/2020-03-04_mag.md
parent: 4598ea9242d1001e473e6340342afea854868577 (diff)
download: sandcrawler-5dd8785d710cf7d067afdc691069bfa74406e06a.tar.gz
sandcrawler-5dd8785d710cf7d067afdc691069bfa74406e06a.zip
1 files changed, 0 insertions, 576 deletions
diff --git a/notes/ingest/2020-03-04_mag.md b/notes/ingest/2020-03-04_mag.md
deleted file mode 100644
index 428ce05..0000000
--- a/notes/ingest/2020-03-04_mag.md
+++ /dev/null
@@ -1,576 +0,0 @@
-
-Rough plan:
-
-- run bulk and/or regular ingest requests for just those of AIT partners (200k?)
-- persist ingest requests (22 million or so)
-- run bulk ingest over 'no status' / 'no match' requests (aka, those not in unpaywall)
-- crawl those which are no-capture
-
-
-## Generate Requests
-
-Newer version of `mag_ingest_request.sh` script requires venv with urlcanon
-installed.
-
-Starting with the 2020-01-23 MAG dump, will generate a full ingest request set
-(including DOI `ext_id` when available), with any dominant domains removed (eg,
-arxiv.org):
-
-    export LC_ALL=C
-    cat PaperUrls_mag_url_doi.all.txt | rg -a -v arxiv.org | rg -a "://" | ./mag_ingest_request.py - --created-date 2020-01-23 | pv -l > ingest_requests_mag-2020-01-23.doi.json
-    => previously 25.6M
-    => 25.6M 2:29:43 [2.85k/s]
-
-    export LC_ALL=C
-    zcat PaperUrls_mag_url_pmid.txt.gz | rg -a -v arxiv.org | rg -a "://" | ./mag_ingest_request.py - --created-date 2020-01-23 --pmid | pv -l > ingest_requests_mag-2020-01-23.pmid.json
-    => 4.3M 0:25:45 [2.78k/s]
-
-    export LC_ALL=C
-    cat ingest_requests_mag-2020-01-23.json | jq -r "[.base_url, .ext_ids.doi] | @tsv" | sort -u -S 4G > ingest_requests_mag-2020-01-23.full.seed_id
-
-    zcat PaperUrls_PaperExtendedAttributes_pdf.txt.gz | wc -l
-    => 6,504,907
-
-    zcat PaperUrls_mag_url_pmid.txt.gz | wc -l
-    => 4,369,832
-
-    cat ingest_requests_mag-2020-01-23.json | jq .ext_ids.doi -r | rg -a -v '^null$' | wc -l
-    => previously 15,707,405
-    => 15,702,581
-
-    cat ingest_requests_mag-2020-01-23.pmid.json | jq .base_url -r | rg ' ' | wc -l
-    => 0
-    URL encoding seems to be working
-
-## Persist Ingest Requests
-
-First pmid ingest requests, then the all/doi file. The reason to do this order
-is that the all/doi file will have some rows with no DOI (and thus no
-`ext_id`), while the PMID file will not.
-
-    # small sample
-    head /schnell/mag/20200123/ingest_requests_mag-2020-01-23.pmid.json | ./persist_tool.py ingest-request -
-    Worker: Counter({'total': 10, 'skip-result-fields': 10})
-    JSON lines pushed: Counter({'total': 10, 'pushed': 10})
-
-    cat /schnell/mag/20200123/ingest_requests_mag-2020-01-23.pmid.json | ./persist_tool.py ingest-request -
-    => 4.3M 0:16:46 [4.27k/s]
-    Worker: Counter({'total': 4295026, 'insert-requests': 4241862, 'update-requests': 0})
-    JSON lines pushed: Counter({'total': 4295026, 'pushed': 4295026})
-    => hit a bug on first attempt, which is why total/insert results don't match
-
-    cat /schnell/mag/20200123/ingest_requests_mag-2020-01-23.doi.json | ./persist_tool.py ingest-request -
-    => 25.6M 2:21:54 [3.01k/s]
-    Worker: Counter({'total': 25596559, 'insert-requests': 21348393, 'update-requests': 0})
-    JSON lines pushed: Counter({'pushed': 25596559, 'total': 25596559})
-
-
-## Crawl/Dupe Status
-
-    SELECT ingest_file_result.status, COUNT(*)
-    FROM ingest_request
-    LEFT JOIN ingest_file_result
-        ON ingest_file_result.ingest_type = ingest_request.ingest_type
-        AND ingest_file_result.base_url = ingest_request.base_url
-    WHERE 
-        ingest_request.ingest_type = 'pdf'
-        AND ingest_request.link_source = 'mag'
-    GROUP BY status
-    ORDER BY COUNT DESC
-    LIMIT 20;
-
-After just PMID links:
-
-           status        |  count
-    ---------------------+---------
-                         | 3000115
-     success             | 1126881
-     no-capture          |   69459
-     terminal-bad-status |   30259
-     redirect-loop       |   11656
-     no-pdf-link         |    2836
-     wrong-mimetype      |    1456
-     link-loop           |    1259
-     wayback-error       |    1232
-     cdx-error           |     932
-     null-body           |      85
-     petabox-error       |      50
-     bad-redirect        |       1
-    (13 rows)
-
-After all links:
-
-    SELECT COUNT(*)
-    FROM ingest_request
-    WHERE
-        ingest_request.ingest_type = 'pdf'
-        AND ingest_request.link_source = 'mag';
-    => 25596563
-
-
-           status        |  count   
-    ---------------------+----------
-                         | 21130841
-     success             |  3915682
-     no-capture          |   391813
-     terminal-bad-status |    76488
-     redirect-loop       |    44202
-     wrong-mimetype      |    16418
-     no-pdf-link         |    10995
-     wayback-error       |     3679
-     cdx-error           |     3414
-     link-loop           |     2098
-     null-body           |      709
-     petabox-error       |      221
-     bad-gzip-encoding   |        2
-     bad-redirect        |        1
-    (14 rows)
-
-Somewhat more un-ingested than expected.
-
-Dump requests:
-
-    COPY (
-        SELECT row_to_json(ingest_request.*) FROM ingest_request
-        LEFT JOIN ingest_file_result
-            ON ingest_file_result.ingest_type = ingest_request.ingest_type
-            AND ingest_file_result.base_url = ingest_request.base_url
-        WHERE
-            ingest_request.ingest_type = 'pdf'
-            AND ingest_request.link_source = 'mag'
-            AND ingest_file_result.status IS NULL
-    ) TO '/grande/snapshots/mag_noingest_20200305.rows.json';
-    => COPY 21,130,841
-
-Transform and shuf:
-
-    ./scripts/ingestrequest_row2json.py /grande/snapshots/mag_noingest_20200305.rows.json | pv -l | shuf | gzip > /grande/snapshots/mag_noingest_20200305.shuf.json.gz
-    => 21.1M 0:18:57 [18.6k/s]
-
-## Bulk Ingest Partner Output
-
-These are subsets of the full list from potential AIT-S partners; want to run
-these through the pipeline before the full batch. Duplication against the full
-batch should be minimal.
-
-Size:
-
-    bnewbold@ia601101$ cat ingest_requests_mag-2020-01-23.cornell.json | jq .ext_ids.doi | rg -v '^null$' | wc -l
-    29007
-    bnewbold@ia601101$ wc -l ingest_requests_mag-2020-01-23.cornell.json
-    34265 ingest_requests_mag-2020-01-23.cornell.json
-
-Test ingest:
-
-    head -n200 ingest_requests_mag-2020-01-23.cornell.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
-Full ingests:
-
-    cat ingest_requests_mag-2020-01-23.cornell.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-    cat ingest_requests_mag-2020-01-23.alberta.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-    cat ingest_requests_mag-2020-01-23.columbia.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-    cat ingest_requests_mag-2020-01-23.emory.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-    cat ingest_requests_mag-2020-01-23.stanford.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
-## Bulk Ingest
-
-Shard it into batches of roughly 1 million:
-
-    cd /grande/snapshots/
-    zcat /grande/snapshots/mag_noingest_20200305.shuf.json.gz | split -n r/20 -d - mag_noingest_20200305.ingest_request.split_ --additional-suffix=.json
-
-Add a single batch like:
-
-    cat mag_noingest_20200305.ingest_request.split_00.json | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
-    partner ingests (see above)
-    => 2020-03-05 12:49:   118,396
-    1056543 mag_noingest_20200305.ingest_request.split_00.json
-    => 2020-03-05 14:34: 1,055,224
-    => check on stats/ratios; filter by ingest update time?
-    1056542 mag_noingest_20200305.ingest_request.split_01.json
-    1056542 mag_noingest_20200305.ingest_request.split_02.json
-    1056542 mag_noingest_20200305.ingest_request.split_03.json
-    1056542 mag_noingest_20200305.ingest_request.split_04.json
-    1056542 mag_noingest_20200305.ingest_request.split_05.json
-    1056542 mag_noingest_20200305.ingest_request.split_06.json
-    1056542 mag_noingest_20200305.ingest_request.split_07.json
-    1056542 mag_noingest_20200305.ingest_request.split_08.json
-    1056542 mag_noingest_20200305.ingest_request.split_09.json
-    => 2020-03-05 18:04: 10,009,297
-    => 2020-03-06 16:53:  6,553,946
-    1056542 mag_noingest_20200305.ingest_request.split_10.json
-    1056542 mag_noingest_20200305.ingest_request.split_11.json
-    1056542 mag_noingest_20200305.ingest_request.split_12.json
-    1056542 mag_noingest_20200305.ingest_request.split_13.json
-    1056542 mag_noingest_20200305.ingest_request.split_14.json
-    1056542 mag_noingest_20200305.ingest_request.split_15.json
-    1056542 mag_noingest_20200305.ingest_request.split_16.json
-    1056542 mag_noingest_20200305.ingest_request.split_17.json
-    1056542 mag_noingest_20200305.ingest_request.split_18.json
-    1056542 mag_noingest_20200305.ingest_request.split_19.json
-    => 2020-03-06 16:59: 17,001,032
-
-Stats from bulk ingest:
-
-    SELECT ingest_file_result.status, COUNT(*)
-    FROM ingest_request
-    LEFT JOIN ingest_file_result
-        ON ingest_file_result.ingest_type = ingest_request.ingest_type
-        AND ingest_file_result.base_url = ingest_request.base_url
-    WHERE 
-        ingest_request.ingest_type = 'pdf'
-        AND ingest_request.link_source = 'mag'
-    GROUP BY status
-    ORDER BY COUNT DESC
-    LIMIT 20;
-
-           status        |  count
-    ---------------------+----------
-     no-capture          | 12237193
-     success             | 11991293
-     no-pdf-link         |   521691
-     redirect-loop       |   437192
-     terminal-bad-status |   231181
-     link-loop           |    92633
-     cdx-error           |    33631
-     wrong-mimetype      |    28638
-     wayback-error       |    19651
-     null-body           |     2682
-     petabox-error       |      727
-                         |       47
-     bad-redirect        |       44
-     bad-gzip-encoding   |        7
-    (14 rows)
-
-Failures by domain:
-
-    SELECT domain, status, COUNT((domain, status))
-    FROM (
-        SELECT
-            ingest_file_result.ingest_type,
-            ingest_file_result.status,
-            substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
-        FROM ingest_file_result
-        LEFT JOIN ingest_request
-            ON ingest_file_result.ingest_type = ingest_request.ingest_type
-            AND ingest_file_result.base_url = ingest_request.base_url
-        WHERE 
-            ingest_file_result.ingest_type = 'pdf'
-            AND ingest_request.link_source = 'mag'
-    ) t1
-    WHERE t1.domain != ''
-        AND t1.status != 'success'
-        AND t1.status != 'no-capture'
-    GROUP BY domain, status
-    ORDER BY COUNT DESC
-    LIMIT 30;
-
-                    domain                |       status        | count  
-    --------------------------------------+---------------------+--------
-     dialnet.unirioja.es                  | redirect-loop       | 240967
-     onlinelibrary.wiley.com              | no-pdf-link         | 147696
-     agupubs.onlinelibrary.wiley.com      | no-pdf-link         |  72639
-     iopscience.iop.org                   | terminal-bad-status |  69591
-     febs.onlinelibrary.wiley.com         | no-pdf-link         |  49874
-     www.researchgate.net                 | redirect-loop       |  42859
-     journals.sagepub.com                 | no-pdf-link         |  27448
-     papers.ssrn.com                      | redirect-loop       |  27328
-     dialnet.unirioja.es                  | terminal-bad-status |  20320
-     physoc.onlinelibrary.wiley.com       | no-pdf-link         |  20232
-     science.sciencemag.org               | link-loop           |  17811
-     espace.library.uq.edu.au             | redirect-loop       |  17185
-     bpspubs.onlinelibrary.wiley.com      | no-pdf-link         |  15785
-     obgyn.onlinelibrary.wiley.com        | no-pdf-link         |  15301
-     anthrosource.onlinelibrary.wiley.com | no-pdf-link         |  13746
-     www.tandfonline.com                  | no-pdf-link         |  13303
-     aasldpubs.onlinelibrary.wiley.com    | no-pdf-link         |  11070
-     link.springer.com                    | redirect-loop       |  10594
-     www.redalyc.org:9081                 | no-pdf-link         |  10515
-     watermark.silverchair.com            | terminal-bad-status |   9739
-     www.bmj.com                          | link-loop           |   9389
-     www.repository.naturalis.nl          | redirect-loop       |   8213
-     bjp.rcpsych.org                      | link-loop           |   8045
-     aslopubs.onlinelibrary.wiley.com     | no-pdf-link         |   7814
-     nph.onlinelibrary.wiley.com          | no-pdf-link         |   7801
-     iopscience.iop.org                   | redirect-loop       |   7697
-     journals.tubitak.gov.tr              | wrong-mimetype      |   7159
-     www.biorxiv.org                      | wrong-mimetype      |   7067
-     www.erudit.org                       | redirect-loop       |   6819
-     besjournals.onlinelibrary.wiley.com  | no-pdf-link         |   6254
-    (30 rows)
-
-Domains to follow-up (eg, sandcrawler ingest tests/tweaks):
-- dialnet.unirioja.es | redirect-loop | 240967
-- www.researchgate.net | redirect-loop |  42859
-- www.redalyc.org:9081 | no-pdf-link |  10515
-- www.repository.naturalis.nl | redirect-loop | 8213
-- bjp.rcpsych.org | link-loop | 8045
-- journals.tubitak.gov.tr | wrong-mimetype | 7159
-- www.erudit.org | redirect-loop | 6819
-
-The dialnet.unirioja.es ones may be worth re-crawling via heritrix?
-
-Top uncrawled domains:
-
-    SELECT domain, status, COUNT((domain, status))
-    FROM (
-        SELECT
-            ingest_file_result.ingest_type,
-            ingest_file_result.status,
-            substring(ingest_file_result.base_url FROM '[^/]+://([^/]*)') AS domain
-        FROM ingest_file_result
-        LEFT JOIN ingest_request
-            ON ingest_file_result.ingest_type = ingest_request.ingest_type
-            AND ingest_file_result.base_url = ingest_request.base_url
-        WHERE
-            ingest_file_result.ingest_type = 'pdf'
-            AND ingest_request.link_source = 'mag'
-    ) t1
-    WHERE t1.domain != ''
-        AND t1.status = 'no-capture'
-    GROUP BY domain, status
-    ORDER BY COUNT DESC
-    LIMIT 30;
-
-                 domain              |   status   | count  
-    ---------------------------------+------------+--------
-     ieeexplore.ieee.org             | no-capture | 957835
-     link.springer.com               | no-capture | 394121
-     www.researchgate.net            | no-capture | 376974
-     cyberleninka.ru                 | no-capture | 376012
-     iopscience.iop.org              | no-capture | 348791
-     papers.ssrn.com                 | no-capture | 286860
-     dergipark.org.tr                | no-capture | 217556
-     dialnet.unirioja.es             | no-capture | 214398
-     academic.oup.com                | no-capture | 212364
-     www.tandfonline.com             | no-capture | 148940
-     journals.sagepub.com            | no-capture | 144695
-     www.papersearch.net             | no-capture | 138986
-     absimage.aps.org                | no-capture | 111976
-     apps.dtic.mil                   | no-capture | 106984
-     www.cambridge.org               | no-capture |  97533
-     www.bmj.com                     | no-capture |  92437
-     bioone.org                      | no-capture |  87573
-     science.sciencemag.org          | no-capture |  75723
-     shodhganga.inflibnet.ac.in:8080 | no-capture |  75395
-     www.jstor.org                   | no-capture |  73230
-     works.bepress.com               | no-capture |  68747
-     www.scielo.org.co               | no-capture |  59650
-     hrcak.srce.hr                   | no-capture |  59332
-     muse.jhu.edu                    | no-capture |  57828
-     onlinelibrary.wiley.com         | no-capture |  55621
-     www.jbc.org                     | no-capture |  54608
-     www.jstage.jst.go.jp            | no-capture |  53631
-     www.redalyc.org                 | no-capture |  50406
-     lup.lub.lu.se                   | no-capture |  47469
-     www.dtic.mil                    | no-capture |  41820
-    (30 rows)
-
-## Heritrix Seedlist Generation
-
-Dump ingest requests (filtered for some domains that don't expect to crawl via
-heritrix):
-
-    COPY (  
-        SELECT row_to_json(ingest_request.*) FROM ingest_request
-        LEFT JOIN ingest_file_result
-            ON ingest_file_result.ingest_type = ingest_request.ingest_type
-            AND ingest_file_result.base_url = ingest_request.base_url
-        WHERE
-            ingest_request.ingest_type = 'pdf'
-            AND ingest_request.link_source = 'mag'
-            AND ingest_file_result.status = 'no-capture'
-            AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
-            AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
-            AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
-            AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
-            AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
-            AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
-            AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
-    ) TO '/grande/snapshots/mag_nocapture_20200313.rows.json';
-    => COPY 11714199
-
-    # in sandcrawler pipenv
-    ./scripts/ingestrequest_row2json.py /grande/snapshots/mag_nocapture_20200313.rows.json > /grande/snapshots/mag_nocapture_20200313.json
-
-## Bulk Ingest of Heritrix Content
-
-Small sample:
-
-    head -n 1000 mag_nocapture_20200313.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
-Full run:
-
-    cat mag_nocapture_20200313.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-
-    2020-04-07 12:19 (pacific): 11,703,871
-
-## Post-bulk-ingest
-
-Around 2020-04-28, seems like main wave of bulk ingest is complete. Will need
-to re-try things like cdx-error.
-
-Current status:
-
-                status             |  count
-    -------------------------------+----------
-     success                       | 18491799
-     redirect-loop                 |  1968530
-     no-capture                    |  1373657
-     no-pdf-link                   |  1311842
-     link-loop                     |  1296439
-     terminal-bad-status           |   627577
-     cdx-error                     |   418278
-     wrong-mimetype                |    50141
-     wayback-error                 |    37159
-     petabox-error                 |    11249
-     null-body                     |     6295
-     gateway-timeout               |     3051
-     spn2-cdx-lookup-failure       |      328
-     spn2-error:invalid-url-syntax |       93
-     bad-redirect                  |       75
-                                   |       47
-     invalid-host-resolution       |       28
-     spn2-error                    |       10
-     bad-gzip-encoding             |        7
-     redirects-exceeded            |        2
-    (20 rows)
-
-Lots of cdx-error to retry.
-
-The no-capture links are probably a mix of domain-blocklist and things that
-failed in bulk mode. Will dump and re-attempt them:
-
-
-    COPY (  
-        SELECT row_to_json(ingest_request.*) FROM ingest_request
-        LEFT JOIN ingest_file_result
-            ON ingest_file_result.ingest_type = ingest_request.ingest_type
-            AND ingest_file_result.base_url = ingest_request.base_url
-        WHERE
-            ingest_request.ingest_type = 'pdf'
-            AND ingest_request.link_source = 'mag'
-            AND ingest_file_result.status = 'no-capture'
-            AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
-            AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
-            AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
-            AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
-            AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
-            AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
-            AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
-    ) TO '/grande/snapshots/mag_nocapture_20200420.rows.json';
-    => 859849
-
-What domains are these?
-
-    cat mag_nocapture_20200420.rows.json | jq .base_url -r | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n30
-
-Let's filter down more:
-
-    cat mag_nocapture_20200420.rows.json | rg -v 'www.researchgate.net' | rg -v 'muse.jhu.edu' | rg -v 'www.omicsonline.org' | rg -v 'link.springer.com' | rg -v 'iopscience.iop.org' | rg -v 'ieeexplore.ieee.org' | shuf > mag_nocapture_20200420.rows.filtered.json
-
-    wc -l mag_nocapture_20200420.rows.filtered.json
-    423085 mag_nocapture_20200420.rows.filtered.json
-
-Ok, enqueue!
-
-    cat mag_nocapture_20200420.rows.filtered.json | shuf | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p -1
-
-## Final Stats
-
-... for this round of ingest:
-
-    SELECT ingest_file_result.status, COUNT(*)
-    FROM ingest_request
-    LEFT JOIN ingest_file_result
-        ON ingest_file_result.ingest_type = ingest_request.ingest_type
-        AND ingest_file_result.base_url = ingest_request.base_url
-    WHERE 
-        ingest_request.ingest_type = 'pdf'
-        AND ingest_request.link_source = 'mag'
-    GROUP BY status
-    ORDER BY COUNT DESC
-    LIMIT 20;
-
-
-                   status                |  count
-    -------------------------------------+----------
-     success                             | 18712849
-     redirect-loop                       |  2008110
-     no-pdf-link                         |  1337012
-     link-loop                           |  1326761
-     no-capture                          |  1030693
-     terminal-bad-status                 |   637143
-     gateway-timeout                     |   193194
-     cdx-error                           |   125907
-     spn2-cdx-lookup-failure             |    77842
-     wrong-mimetype                      |    50882
-     wayback-error                       |    40278
-     invalid-host-resolution             |    35201
-     petabox-error                       |    11254
-     null-body                           |     6485
-     spn2-error                          |     1643
-     spn2-error:job-failed               |      747
-     spn2-error:invalid-url-syntax       |      325
-     spn2-error:soft-time-limit-exceeded |      190
-     bad-redirect                        |       77
-                                         |       47
-    (20 rows)
-
-Failures by domain:
-
-    SELECT domain, status, COUNT((domain, status))
-    FROM (
-        SELECT
-            ingest_file_result.ingest_type,
-            ingest_file_result.status,
-            substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
-        FROM ingest_file_result
-        LEFT JOIN ingest_request
-            ON ingest_file_result.ingest_type = ingest_request.ingest_type
-            AND ingest_file_result.base_url = ingest_request.base_url
-        WHERE 
-            ingest_file_result.ingest_type = 'pdf'
-            AND ingest_request.link_source = 'mag'
-    ) t1
-    WHERE t1.domain != ''
-        AND t1.status != 'success'
-    GROUP BY domain, status
-    ORDER BY COUNT DESC
-    LIMIT 30;
-
-
-                 domain              |       status        | count
-    ---------------------------------+---------------------+--------
-     ieeexplore.ieee.org             | redirect-loop       | 677712
-     cyberleninka.ru                 | link-loop           | 308390
-     papers.ssrn.com                 | link-loop           | 281804
-     ieeexplore.ieee.org             | link-loop           | 273559
-     dialnet.unirioja.es             | redirect-loop       | 240504
-     dialnet.unirioja.es             | terminal-bad-status | 232481
-     onlinelibrary.wiley.com         | no-pdf-link         | 220932
-     iopscience.iop.org              | terminal-bad-status | 172480
-     validate.perfdrive.com          | no-pdf-link         | 172312
-     link.springer.com               | redirect-loop       | 130398
-     agupubs.onlinelibrary.wiley.com | no-pdf-link         | 113382
-     iopscience.iop.org              | redirect-loop       | 105234
-     www.bmj.com                     | link-loop           | 100354
-     www.researchgate.net            | redirect-loop       |  84366
-     www.cambridge.org               | link-loop           |  83171
-     jamanetwork.com                 | no-pdf-link         |  75053
-     febs.onlinelibrary.wiley.com    | no-pdf-link         |  74872
-     www.jstor.org                   | redirect-loop       |  72059
-     journals.sagepub.com            | no-pdf-link         |  63028
-     science.sciencemag.org          | redirect-loop       |  62927
-     profile.thieme.de               | no-pdf-link         |  62406
-     cyberleninka.ru                 | redirect-loop       |  56733
-     link.springer.com               | link-loop           |  47608
-     physoc.onlinelibrary.wiley.com  | no-pdf-link         |  30180
-     science.sciencemag.org          | link-loop           |  29908
-     papers.ssrn.com                 | redirect-loop       |  27255
-     obgyn.onlinelibrary.wiley.com   | no-pdf-link         |  26789
-     www.computer.org                | no-pdf-link         |  26444
-     watermark.silverchair.com       | terminal-bad-status |  25934
-     www.nature.com                  | redirect-loop       |  25306
-    (30 rows)
author	Bryan Newbold <bnewbold@archive.org>	2020-05-26 14:47:17 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2020-05-26 14:47:17 -0700
commit	5dd8785d710cf7d067afdc691069bfa74406e06a (patch)
tree	8ff16b25cee10f38127caf7fdb266d41fea12d83 /notes/ingest/2020-03-04_mag.md
parent	4598ea9242d1001e473e6340342afea854868577 (diff)
download	sandcrawler-5dd8785d710cf7d067afdc691069bfa74406e06a.tar.gz sandcrawler-5dd8785d710cf7d067afdc691069bfa74406e06a.zip