import arabesque script

author: Bryan Newbold <bnewbold@archive.org> 2019-02-28 12:01:12 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2019-02-28 12:01:12 -0800
commit: dd50d0bb50549598da2e9ed595a12f154636ae2e (patch)
tree: 7220a89aac25fbd77acdd516f8f4845ee3e44852
parent: b5f283913367d28578ad403b850e3a5c2e380887 (diff)
download: arabesque-dd50d0bb50549598da2e9ed595a12f154636ae2e.tar.gz
arabesque-dd50d0bb50549598da2e9ed595a12f154636ae2e.zip
5 files changed, 1110 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..2ae80e8
--- /dev/null
+++ b/README.md
@@ -0,0 +1,327 @@
+
+Going to look something like:
+
+    zcat DOI-LANDING-CRAWL-2018-06-full_crawl_logs/DOI-LANDING-CRAWL-2018-06.$SHARD.us.archive.org.crawl.log.gz | tr -cd '[[:print:]]\n\r\t' | rg '//doi.org/' | /fast/scratch/unpaywall/make_doi_list.py > doi_list.$SHARD.txt
+
+    zcat /fast/unpaywall-munging/DOI-LANDING-CRAWL-2018-06/DOI-LANDING-CRAWL-2018-06-full_crawl_logs/DOI-LANDING-CRAWL-2018-06.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD.db
+
+    cat /fast/unpaywall-munging/DOI-LANDING-CRAWL-2018-06/doi_list.$SHARD.txt | pv | /fast/scratch/unpaywall/make_output.py redirectmap.$SHARD.db > doi_index.$SHARD.tsv
+
+Let's start with:
+
+    mkdir UNPAYWALL-PDF-CRAWL-2018-07
+    ia download UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs
+
+export SHARD=wbgrp-svc279 # running
+export SHARD=wbgrp-svc280 # running
+export SHARD=wbgrp-svc281 # running
+export SHARD=wbgrp-svc282 # running
+zcat UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs/UNPAYWALL-PDF-CRAWL-2018-07.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD.db
+zcat UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs/UNPAYWALL-PDF-CRAWL-2018-07-PATCH.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD-PATCH.db
+
+### Design
+
+If possible, we'd like something that will work with as many crawls as
+possible. Want to work with shards, then merge outputs.
+
+Output: JSON and/or sqlite rows with:
+
+- identifier (optional?)
+- initial-uri (indexed)
+- breadcrumbs
+- final-uri
+- final-http-status
+- final-sha1
+- final-mimetype-normalized
+- final-was-dedupe (boolean)
+- final-cdx (string, if would be extracted)
+
+This will allow filtering on various fields, checking success stats, etc.
+
+Components:
+
+- {identifier, initial-uri} input (basically, seedlist)
+- full crawl logs
+- raw CDX, indexed by final-uri
+- referer map
+
+Process:
+
+- use full crawl logs to generate a referer map; this is a dict with keys as
+  URI, and value as {referer URI, status, breadcrumb, was-dedupe, mimetype};
+  the referer may be null. database can be whatever.
+- iterate through CDX, filtering by HTTP status and mimetype (including
+  revists). for each potential, lookup in referer map. if mimetype is
+  confirmed, then iterate through full referer chain, and print a final line
+  which is all-but-identifier
+- iterate through identifier/URI list, inserting identifier columns
+
+Complications:
+
+- non-PDF terminals: error codes, or HTML only (failed to find PDF)
+- multiple terminals per seed; eg, multiple PDFs, or PDF+postscript+HTML or
+  whatever
+
+Process #2:
+
+- use full crawl logs to generate a bi-directional referer map: sqlite3 table
+  with uri, referer-uri both indexed. also {status, breadcrumb, was-dedupe,
+  mimetype} rows
+- iterate through CDX, selecting successful "terminal" lines (mime, status).
+  use referer map to iterate back to an initial URI, and generate a row. lookup
+  output table by initial-uri; if an entry already exists, behavior is
+  flag-dependent: overwrite if "better", or add a second line
+- in a second pass, update rows with identifier based on URI. if rows not
+  found/updated, then do a "forwards" lookup to a terminal condition, and write
+  that status. note that these rows won't have CDX.
+
+More Complications:
+
+- handling revisits correctly... raw CDX probably not actually helpful for PDF
+  case, only landing/HTML case
+- given above, should probably just (or as a mode) iterate over only crawl logs
+  in "backwards" stage
+- fan-out of "forward" redirect map, in the case of embeds and PDF link
+  extraction
+- could pull out first and final URI domains for easier SQL stats/reporting
+- should include final datetime (for wayback lookups)
+
+NOTE/TODO: journal-level dumps of fatcat metadata would be cool... could
+roll-up release dumps as an alternative to hitting elasticsearch? or just hit
+elasticsearch and both dump to sqlite and enrich elastic doc? should probably
+have an indexed "last updated" timestamp in all elastic docs
+
+### Crawl Log Notes
+
+Fields:
+
+    0   timestamp (ISO8601) of log line
+    1   status code (HTTP or negative)
+    2   size in bytes (content only)
+    3   URI of this download
+    4   discovery breadcrumbs
+    5   "referer" URI
+    6   mimetype (as reported?)
+    7   worker thread
+    8   full timestamp (start of network fetch; this is dt?)
+    9   SHA1
+    10  source tag
+    11  annotations
+    12  partial CDX JSON
+
+### External Prep for, Eg, Unpaywall Crawl
+
+    export LC_ALL=C
+    sort -S 8G -u seedlist.shard > seedlist.shard.sorted
+
+    zcat unpaywall_20180621.pdf_meta.tsv.gz | awk '{print $2 "\t" $1}' | sort -S 8G -u > unpaywall_20180621.seed_id.tsv
+
+    join -t $'\t' unpaywall_20180621.seed_id.tsv unpaywall_crawl_patch_seedlist.split_3.schedule.sorted > seed_id.shard.tsv
+
+TODO: why don't these sum/match correctly?
+
+    bnewbold@orithena$ wc -l seed_id.shard.tsv unpaywall_crawl_patch_seedlist.split_3.schedule.sorted
+    880737 seed_id.shard.tsv
+    929459 unpaywall_crawl_patch_seedlist.split_3.schedule.sorted
+
+    why is:
+    http://00ec89c.netsolhost.com/brochures/200605_JAWMA_Hg_Paper_Lee_Hastings.pdf
+    in unpaywall_crawl_patch_seedlist, but not unpaywall_20180621.pdf_meta?
+
+    # Can't even filter on HTTP 200, because revisits are '-'
+    #zcat UNPAYWALL-PDF-CRAWL-2018-07.cdx.gz | rg 'wbgrp-svc282' | rg ' 200 ' | rg '(pdf)|(revisit)' > UNPAYWALL-PDF-CRAWL-2018-07.svc282.filtered.cdx
+
+    zcat UNPAYWALL-PDF-CRAWL-2018-07.cdx.gz | rg 'UNPAYWALL-PDF-CRAWL-2018-07-PATCH' | rg 'wbgrp-svc282' | rg '(pdf)|( warc/revisit )|(postscript)|( unk )' > UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx
+
+TODO: spaces in URLs, like 'https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf'
+
+### Revisit Notes
+
+Neither CDX nor crawl logs seem to have revisits actually point to final
+content, they just point to the revisit record in the (crawl-local) WARC.
+
+### sqlite3 stats
+
+    select count(*) from crawl_result;
+
+    select count(*) from crawl_result where identifier is null;
+
+    select breadcrumbs, count(*) from crawl_result group by breadcrumbs;
+
+    select final_was_dedupe, count(*) from crawl_result group by final_was_dedupe;
+
+    select final_http_status, count(*) from crawl_result group by final_http_status;
+
+    select final_mimetype, count(*) from crawl_result group by final_mimetype;
+
+    select * from crawl_result where final_mimetype = 'text/html' and final_http_status = '200' order by random() limit 5;
+
+    select count(*) from crawl_result where final_uri like 'https://academic.oup.com/Govern%';
+
+    select count(distinct identifier) from crawl_result where final_sha1 is not null;
+
+### testing shard notes
+
+880737  `seed_id` lines
+21776   breadcrumbs are null (no crawl logs line); mostly normalized URLs?
+24985   "first" URIs with no identifier; mostly normalized URLs?
+
+backward: Counter({'skip-cdx-scope': 807248, 'inserted': 370309, 'skip-map-scope': 2913})
+forward (dirty): Counter({'inserted': 509242, 'existing-id-updated': 347218, 'map-uri-missing': 15556, 'existing-complete': 8721, '_normalized-seed-uri': 5520})
+
+874131 identifier is not null
+881551 breadcrumbs is not null
+376057 final_mimetype is application/pdf
+370309 final_sha1 is not null
+332931 application/pdf in UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx
+
+summary:
+    370309/874131 42% got a PDF
+    264331/874131 30% some domain dead-end
+        196747/874131 23% onlinelibrary.wiley.com
+        33879/874131   4% www.nature.com
+        11074/874131   1% www.tandfonline.com
+    125883/874131 14% blocked, 404, other crawl failures
+            select count(*) from crawl_result where final_http_status >= '400' or final_http_status < '200';
+    121028/874131 14% HTTP 200, but not pdf
+        105317/874131 12% academic.oup.com; all rate-limited or cookie fail
+    15596/874131  1.7% didn't even try crawling (null final status)
+
+TODO:
+- add "success" flag (instead of "final_sha1 is null")
+- 
+
+    http://oriental-world.org.ua/sites/default/files/Archive/2017/3/4.pdf   10.15407/orientw2017.03.021 -       http://oriental-world.org.ua/sites/default/files/Archive/2017/3/4.pdf   403     ¤       application/pdf 0       ¤
+
+Iterated:
+
+./arabesque.py backward UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx map.sqlite out.sqlite
+Counter({'skip-cdx-scope': 813760, 'inserted': 370435, 'skip-map-scope': 4620, 'skip-tiny-octetstream-': 30})
+
+./arabesque.py forward unpaywall_20180621.seed_id.shard.tsv map.sqlite out.sqlite
+Counter({'inserted': 523594, 'existing-id-updated': 350009, '_normalized-seed-uri': 21371, 'existing-complete': 6638, 'map-uri-missing': 496})
+
+894029 breadcrumbs is not null
+874102 identifier is not null
+20423 identifier is null
+496 breadcrumbs is null
+370435 final_sha1 is not null
+
+### URL/seed non-match issues!
+
+Easily fixable:
+- capitalization of domains
+- empty port number, like `http://genesis.mi.ras.ru:/~razborov/hadamard.ps`
+
+Encodable:
+- URL encoding
+    http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11[1].pdf
+    http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11%5B1%5D.pdf
+- whitespace in URL (should be url-encoded)
+    https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf
+    https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A%EF%BF%BD%EF%BF%BDcase.pdf
+- tricky hidden unicode
+    http://goldhorde.ru/wp-content/uploads/2017/03/ЗО-1-2017-206-212.pdf
+    http://goldhorde.ru/wp-content/uploads/2017/03/%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD-1-2017-206-212.pdf
+
+Harder/Custom?
+- paths including "/../" or "/./" are collapsed
+- port number 80, like `http://fermet.misis.ru:80/jour/article/download/724/700`
+- aos2.uniba.it:8080papers
+
+- fragments stripped by crawler: 'https://www.termedia.pl/Journal/-85/pdf-27083-10?filename=BTA#415-06-str307-316.pdf'
+
+### Debugging "redirect terminal" issue
+
+Some are redirect loops; fine.
+
+Some are from 'cookieSet=1' redirects, like 'http://journals.sagepub.com/doi/pdf/10.1177/105971230601400206?cookieSet=1'. This comes through like:
+
+    sqlite> select * from crawl_result where initial_uri = 'http://adb.sagepub.com/cgi/reprint/14/2/147.pdf';
+    initial_uri     identifier      breadcrumbs     final_uri       final_http_status       final_sha1      final_mimetype  final_was_dedupe        final_cdx
+    http://adb.sagepub.com/cgi/reprint/14/2/147.pdf 10.1177/105971230601400206 R       http://journals.sagepub.com/doi/pdf/10.1177/105971230601400206  302     ¤       text/html       0       ¤
+
+Using 'http' (note: this is not an OA article):
+
+    http://adb.sagepub.com/cgi/reprint/14/2/147.pdf
+    https://journals.sagepub.com/doi/pdf/10.1177/105971230601400206
+    https://journals.sagepub.com/doi/pdf/10.1177/105971230601400206?cookieSet=1
+    http://journals.sagepub.com/action/cookieAbsent
+
+Is heritrix refusing to do that second redirect? In some cases it will do at
+leat the first, like:
+    
+    http://pubs.rsna.org/doi/pdf/10.1148/radiographics.11.1.1996385
+    http://pubs.rsna.org/doi/pdf/10.1148/radiographics.11.1.1996385?cookieSet=1
+    http://pubs.rsna.org/action/cookieAbsent
+
+I think the vast majority of redirect terminals are when we redirect to a page
+that has already been crawled. This is a bummer because we can't find the
+redirect target in the logs.
+
+Eg, academic.oup.com sometimes redirects to cookieSet, then cookieAbsent; other
+times it redirects to Governer. It's important to distinguish between these.
+
+### Scratch
+
+What are actual advantages/use-cases of CDX mode?
+=> easier CDX-to-WARC output mode
+=> sending CDX along with WARCs as an index
+
+Interested in scale-up behavior: full unpaywall PDF crawls, and/or full DOI landing crawls
+=> eatmydata
+dentifier is not null
+
+
+    zcat UNPAYWALL-PDF-CRAWL-2018-07-PATCH* | time /fast/scratch/unpaywall/arabesque.py referrer - UNPAYWALL-PDF-CRAWL-2018-07-PATCH.map.sqlite
+    [snip]
+    ... referrer 5542000
+    Referrer map complete.
+    317.87user 274.57system 21:20.22elapsed 46%CPU (0avgtext+0avgdata 22992maxresident)k
+    24inputs+155168464outputs (0major+802114minor)pagefaults 0swaps
+
+    bnewbold@ia601101$ ls -lathr
+    -rw-r--r-- 1 bnewbold bnewbold 1.7G Dec 12 12:33 UNPAYWALL-PDF-CRAWL-2018-07-PATCH.map.sqlite
+
+Scaling!
+
+    16,736,800 UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.us.archive.org.crawl.log
+    17,215,895 unpaywall_20180621.seed_id.tsv
+
+Oops; need to shard the seed_id file.
+
+Ugh, this one is a little derp because I didn't sort correctly. Let's say close enough though...
+
+    4318674 unpaywall_crawl_seedlist.svc282.tsv
+    3901403 UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.seed_id.tsv
+
+
+/fast/scratch/unpaywall/arabesque.py everything CORE-UPSTREAM-CRAWL-2018-11.combined.log core_2018-03-01_metadata.seed_id.tsv CORE-UPSTREAM-CRAWL-2018-11.out.sqlite
+
+    Counter({'inserted': 3226191, 'skip-log-scope': 2811395, 'skip-log-prereq': 108932, 'skip-tiny-octetstream-': 855, 'skip-map-scope': 2})
+    Counter({'existing-id-updated': 3221984, 'inserted': 809994, 'existing-complete': 228909, '_normalized-seed-uri': 17287, 'map-uri-missing': 2511, '_redirect-recursion-limit': 221, 'skip-bad-seed-uri': 17})
+
+time /fast/scratch/unpaywall/arabesque.py everything UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.us.archive.org.crawl.log UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.seed_id.tsv UNPAYWALL-PDF-CRAWL-2018-07.out.sqlite
+
+    Everything complete!
+    Counter({'skip-log-scope': 13476816, 'inserted': 2536452, 'skip-log-prereq': 682460, 'skip-tiny-octetstream-': 41067})
+    Counter({'existing-id-updated': 1652463, 'map-uri-missing': 1245789, 'inserted': 608802, 'existing-complete': 394349, '_normalized-seed-uri': 22573, '_redirect-recursion-limit': 157})
+
+    real    63m42.124s
+    user    53m31.007s
+    sys     6m50.535s
+
+### Performance
+
+Before tweaks:
+
+    real    2m55.975s
+    user    2m6.772s
+    sys     0m12.684s
+
+After:
+
+    real    1m51.500s
+    user    1m44.600s
+    sys     0m3.496s
+
diff --git a/arabesque.py b/arabesque.py
new file mode 100755
index 0000000..a4b3e07
--- /dev/null
+++ b/arabesque.py
@@ -0,0 +1,665 @@
+#!/usr/bin/env python3
+
+"""
+This is a multi-function script for generating a particular type of crawl
+report output: a table of identifiers, seed URLs, and crawl results, taking in
+to account long redirect chains.
+
+Commands/modes:
+- referrer <input.log> <output-map.sqlite>
+- backward_cdx <input.cdx> <input-map.sqlite> <output.sqlite>
+- backward <input.log> <input-map.sqlite> <output.sqlite>
+- forward <input.seed_identifiers> <output.sqlite>
+- everything <input.log> <input.cdx> <input.seed_identifiers> <output.sqlite>
+
+Design docs in README_pdf_crawl.md
+
+TODO:
+- open map in read-only when appropriate
+- some kind of stats dump command? (querying sqlite)
+- should referrer map be UNIQ?
+- forward outputs get generated multiple times?
+- try: https://pypi.org/project/urlcanon/
+- https://www.talisman.org/~erlkonig/misc/lunatech%5Ewhat-every-webdev-must-know-about-url-encoding/
+
+BUG:
+BAD LOG LINE: 2018-07-27T12:26:24.783Z   200         24 http://www.phywe-es.com/robots.txt 15+LREELLLLLRLELLRLLLRLLLLRLELRRLLLLLLLLLLLLLLLLLLLLP http://www.phywe-es.com/index.php/fuseaction/download/lrn_file/versuchsanleitungen/P2522015/tr/P2522015.pdf text/html #296 20180727122622741+438 sha1:YR6M6GSJYJGMLBBEGCVHLRZO6SISSJAS - unsatisfiableCharsetInHeader:ISO 8859-1 {"contentSize":254,"warcFilename":"UNPAYWALL-PDF-CRAWL-2018-07-20180727122113315-14533-11460~wbgrp-svc282.us.archive.org~8443.warc.gz","warcFileOffset":126308355}
+"""
+
+import sys
+import json
+import time
+import urllib
+import urllib3
+import sqlite3
+import argparse
+import collections
+
+CrawlLine = collections.namedtuple('CrawlLine', [
+    'log_time',
+    'status_code',
+    'size_bytes',
+    'url',
+    'breadcrumbs',
+    'referrer_url', 
+    'mimetype',
+    'worker_thread',
+    'timestamp',
+    'sha1',
+    'source_tag',
+    'annotations',
+    'cdx_json'])
+
+FullCdxLine = collections.namedtuple('FullCdxLine', [
+    'surt',
+    'datetime',
+    'url',
+    'mimetype',
+    'status_code', # will be '-' for warc/revist
+    'sha1', 
+    'unused1',
+    'unused2',
+    'c_size',
+    'offset',
+    'warc'])
+
+ReferrerRow = collections.namedtuple('ReferrerRow', [
+    'url',
+    'referrer_url',
+    'status_code',
+    'breadcrumbs',
+    'mimetype',
+    'is_dedupe'])
+
+NORMAL_MIMETYPE = (
+    'application/pdf',
+    'application/postscript',
+    'text/html',
+    'text/xml',
+    'warc/revisit',
+    'application/octet-stream',
+)
+
+FULLTEXT_MIMETYPES = (
+    "application/pdf",
+    "application/postscript",
+    "application/octet-stream",
+)
+
+def normalize_mimetype(raw):
+    raw = raw.lower()
+    raw = raw.replace('"', '').replace("'", '').replace(',', '')
+    for norm in NORMAL_MIMETYPE:
+        if raw.startswith(norm):
+            return norm
+
+    # Special cases
+    if raw.startswith('application/xml'):
+        return 'text/xml'
+    if raw.startswith('application/x-pdf'):
+        return 'application/pdf'
+    if raw in ('unk', 'unknown', 'other'):
+        return 'application/octet-stream'
+    return raw
+
+def normalize_url(raw):
+    """
+    This is a surprisingly complex function that cleans up URLs.
+
+    Particular issues it fixes:
+    - lower-cases scheme and host/domain
+    - removes blank or redundant port numbers
+    - removes fragments (anchor tags)
+    - URL escapes characters in path, particularly whitespace
+
+    Some of these we maybe shouldn't do, but heritrix does by default
+
+    TODO: heritrix removes fragments, but maybe we shouldn't?
+    """
+    try:
+        u = urllib3.util.parse_url(raw.strip())
+    except:
+        return None
+
+    if u.path is None:
+        return None
+    port = u.port
+    if (port == 80 and u.scheme == 'http') or (port == 443 and u.scheme == 'https'):
+        port = None
+
+    # "Dot Segment Removal" per RFC 3986
+    # via https://stackoverflow.com/a/40536710/4682349
+    segments = u.path.split('/')
+    segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]]
+    resolved = []
+    for segment in segments:
+        if segment in ('../', '..'):
+            if resolved[1:]:
+                resolved.pop()
+        elif segment not in ('./', '.'):
+            resolved.append(segment)
+    path = ''.join(resolved)
+
+    # add any missing slash if there is a query or fragment
+    if (u.query or u.fragment) and not path:
+        path = '/'
+
+    # URL encode the path segment for HTTP URLs
+    if u.scheme in ('http', 'https'):
+        path = urllib.parse.quote(path, safe='/%~+:();$!,')
+    result = urllib3.util.Url(u.scheme, u.auth, u.host.lower(), port, path, u.query, None)
+    return result.url.replace(' ', '%20')
+
+def test_normalize_url():
+    
+    assert (normalize_url('HTTP://ASDF.com/a/../b') == 'http://asdf.com/b')
+    assert (normalize_url('HTTP://ASDF.com/a/./b') == 'http://asdf.com/a/b')
+    assert (normalize_url('HTTP://ASDF.com:/') == 'http://asdf.com/')
+    assert (normalize_url('HTTP://ASDF.com:80/') == 'http://asdf.com/')
+    assert (normalize_url('HTTP://ASDF.com:80papers/123.pdf') == None)
+    assert (normalize_url('HTTP://ASDF.com/a.pdf#123') == 'http://asdf.com/a.pdf')
+    assert (normalize_url('HTTPs://ASDF.com:443/') == 'https://asdf.com/')
+    assert (normalize_url('HTTP://ASDF.com/a/../b') == 'http://asdf.com/b')
+    assert (normalize_url('HTTP://ASDF.com/first second') == 'http://asdf.com/first%20second')
+    assert (normalize_url('HTTP://ASDF.com/first%20second') == 'http://asdf.com/first%20second')
+    assert (normalize_url('Ftp://ASDF.com/a/../b') == 'ftp://asdf.com/b')
+
+    #assert (normalize_url('http://goldhorde.ru/wp-content/uploads/2017/03/ЗО-1-2017-206-212.pdf') ==
+    #    'http://goldhorde.ru/wp-content/uploads/2017/03/%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD-1-2017-206-212.pdf')
+    assert (normalize_url('http://goldhorde.ru/wp-content/uploads/2017/03/ЗО-1-2017-206-212.pdf') ==
+        'http://goldhorde.ru/wp-content/uploads/2017/03/%D0%97%D0%9E-1-2017-206-212.pdf')
+
+    assert (normalize_url('http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11[1].pdf') ==
+        'http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11%5B1%5D.pdf')
+    #assert (normalize_url('https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf') ==
+    #    'https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A%EF%BF%BD%EF%BF%BDcase.pdf')
+    assert (normalize_url('https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf') ==
+        'https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A%20case.pdf')
+    assert (normalize_url('http://mariel.inesc.pt/~lflb/ma98.pdf') ==
+        'http://mariel.inesc.pt/~lflb/ma98.pdf')
+    assert (normalize_url('http://ijpsr.com?action=download_pdf&postid=9952') ==
+        'http://ijpsr.com/?action=download_pdf&postid=9952')
+    assert (normalize_url('http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-0518(199702)35:3<587::AID-POLA25>3.0.CO;2-J/pdf') ==
+        'http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-0518(199702)35:3%3C587::AID-POLA25%3E3.0.CO;2-J/pdf')
+    assert (normalize_url('http://ntj.tax.org/wwtax/ntjrec.nsf/175d710dffc186a385256a31007cb40f/5e1815e49ceb7d318525796800526cf8/$FILE/A04_Cole.pdf') ==
+        'http://ntj.tax.org/wwtax/ntjrec.nsf/175d710dffc186a385256a31007cb40f/5e1815e49ceb7d318525796800526cf8/$FILE/A04_Cole.pdf')
+    assert (normalize_url('http://www.nature.com:80/') ==
+        'http://www.nature.com/')
+    assert (normalize_url('http://www.nature.com:80/polopoly_fs/1.22367!/menu/main/topColumns/topLeftColumn/pdf/547389a.pdf') ==
+        'http://www.nature.com/polopoly_fs/1.22367!/menu/main/topColumns/topLeftColumn/pdf/547389a.pdf')
+    assert (normalize_url('http://pdfs.journals.lww.com/transplantjournal/2012/11271/Extra_Pulmonary_Nocardiosis_and_Perigraft_Abscess,.1668.pdf?token=method|ExpireAbsolute;source|Journals;ttl|1503135985283;payload|mY8D3u1TCCsNvP5E421JYK6N6XICDamxByyYpaNzk7FKjTaa1Yz22MivkHZqjGP4kdS2v0J76WGAnHACH69s21Csk0OpQi3YbjEMdSoz2UhVybFqQxA7lKwSUlA502zQZr96TQRwhVlocEp/sJ586aVbcBFlltKNKo+tbuMfL73hiPqJliudqs17cHeLcLbV/CqjlP3IO0jGHlHQtJWcICDdAyGJMnpi6RlbEJaRheGeh5z5uvqz3FLHgPKVXJzdGlb2qsojlvlytk14LkMXSI/t5I2LVgySZVyHeaTj/dJdRvauPu3j5lsX4K1l3siV;hash|9tFBJUOSJ1hYPXrgBby2Xg==') ==
+        'http://pdfs.journals.lww.com/transplantjournal/2012/11271/Extra_Pulmonary_Nocardiosis_and_Perigraft_Abscess,.1668.pdf?token=method|ExpireAbsolute;source|Journals;ttl|1503135985283;payload|mY8D3u1TCCsNvP5E421JYK6N6XICDamxByyYpaNzk7FKjTaa1Yz22MivkHZqjGP4kdS2v0J76WGAnHACH69s21Csk0OpQi3YbjEMdSoz2UhVybFqQxA7lKwSUlA502zQZr96TQRwhVlocEp/sJ586aVbcBFlltKNKo+tbuMfL73hiPqJliudqs17cHeLcLbV/CqjlP3IO0jGHlHQtJWcICDdAyGJMnpi6RlbEJaRheGeh5z5uvqz3FLHgPKVXJzdGlb2qsojlvlytk14LkMXSI/t5I2LVgySZVyHeaTj/dJdRvauPu3j5lsX4K1l3siV;hash|9tFBJUOSJ1hYPXrgBby2Xg==')
+
+def parse_crawl_line(line):
+    # yup, it's just whitespace, and yup, there's a JSON blob at the end that
+    # "hopefully" contains no whitespace
+    line = line.strip().split()
+    if len(line) != 13:
+        return None
+    # FTP success; need to munge mimetype
+    if line[3].startswith('ftp://') and line[1] == "226" and line[6] == "application/octet-stream":
+        if line[3].lower().endswith('.pdf'):
+            line[6] = "application/pdf"
+        elif line[3].lower().endswith('.ps'):
+            line[6] = "application/postscript"
+
+    # mimetype
+    line[6] = normalize_mimetype(line[6])
+    # SHA1
+    line[9] = line[9].replace('sha1:', '')
+    return CrawlLine(*line)
+
+def parse_full_cdx_line(line):
+    line = line.strip().split(' ')
+    assert len(line) == 11
+    # mimetype
+    line[3] = normalize_mimetype(line[3])
+    return FullCdxLine(*line)
+
+def lookup_referrer_row(cursor, url):
+    #print("Lookup: {}".format(cdx.url))
+    raw = list(cursor.execute('SELECT * from referrer WHERE url=? LIMIT 1', [url]))
+    if not raw:
+        return None
+    raw = list(raw[0])
+    if not raw[1] or raw[1] == '-':
+        raw[1] = None
+    return ReferrerRow(*raw)
+
+def lookup_all_referred_rows(cursor, url):
+    #print("Lookup: {}".format(cdx.url))
+    # TODO: should this SORT BY?
+    result = list(cursor.execute('SELECT * from referrer WHERE referrer=?', [url]))
+    if not result:
+        return None
+    for i in range(len(result)):
+        raw = list(result[i])
+        if not raw[1] or raw[1] == '-':
+            raw[1] = None
+        result[i] = ReferrerRow(*raw)
+    return result
+
+def create_out_table(db):
+    # "eat my data" style database, for speed
+    # NOTE: don't drop indexes here, because we often reuse DB
+    db.executescript("""
+        PRAGMA main.page_size = 4096;
+        PRAGMA main.cache_size = 20000;
+        PRAGMA main.locking_mode = EXCLUSIVE;
+        PRAGMA main.synchronous = OFF;
+        PRAGMA main.journal_mode = MEMORY;
+
+        CREATE TABLE IF NOT EXISTS crawl_result
+            (initial_url text NOT NULL,
+             identifier text,
+             initial_domain text,
+             breadcrumbs text,
+             final_url text,
+             final_domain text text,
+             final_timestamp text,
+             final_status_code text,
+             final_sha1 text,
+             final_mimetype text,
+             final_was_dedupe bool,
+             hit bool);
+    """)
+
+def referrer(log_file, map_db):
+    """
+    TODO: this would probably be simpler, and much faster, as a simple sqlite3 import from TSV
+    """
+    print("Mapping referrers from crawl logs")
+    # "eat my data" style database, for speed
+    map_db.executescript("""
+        PRAGMA main.page_size = 4096;
+        PRAGMA main.cache_size = 20000;
+        PRAGMA main.locking_mode = EXCLUSIVE;
+        PRAGMA main.synchronous = OFF;
+        PRAGMA main.journal_mode = MEMORY;
+
+        CREATE TABLE IF NOT EXISTS referrer
+                 (url text,
+                  referrer text,
+                  status_code text,
+                  breadcrumbs text,
+                  mimetype text,
+                  is_dedupe bool);
+        DROP INDEX IF EXISTS referrer_url;
+        DROP INDEX IF EXISTS referrer_referrer;
+    """)
+    c = map_db.cursor()
+    i = 0
+    for raw in log_file:
+        line = parse_crawl_line(raw)
+        if not line:
+            print("BAD LOG LINE: {}".format(raw.strip()))
+            continue
+        if line.url.startswith('dns:') or line.url.startswith('whois:'):
+            #print("skipping: {}".format(line.url))
+            continue
+        is_dedupe = 'duplicate:digest' in line.annotations
+        # insert {url, referrer, status_code, breadcrumbs, mimetype, is_dedupe}
+        c.execute("INSERT INTO referrer VALUES (?,?,?,?,?,?)",
+            (line.url, line.referrer_url, line.status_code, line.breadcrumbs, line.mimetype, is_dedupe))
+        i = i+1
+        if i % 5000 == 0:
+            print("... referrer {}".format(i))
+            map_db.commit()
+
+    map_db.commit()
+    print("Building indices (this can be slow)...")
+    c.executescript("""
+        CREATE INDEX IF NOT EXISTS referrer_url on referrer (url);
+        CREATE INDEX IF NOT EXISTS referrer_referrer on referrer (referrer);
+    """)
+    c.close()
+    print("Referrer map complete.")
+
+def backward_cdx(cdx_file, map_db, output_db, hit_mimetypes=FULLTEXT_MIMETYPES):
+    """
+    TODO: Hrm, just realized we don't usually have CDX files on a per-machine
+    basis.  Need to call this with a shard string that gets parsed from the
+    WARC field of the full CDX for the whole crawl? Oh well. Large file, but
+    can be shared, and filter should be fast (can optimize with grep as well).
+    """
+    print("Mapping backward from CDX 200/226 to initial urls")
+    counts = collections.Counter({'inserted': 0})
+    m = map_db.cursor()
+    create_out_table(output_db)
+    c = output_db.cursor()
+    i = 0
+    for raw_cdx in cdx_file:
+        if raw_cdx.startswith('CDX') or raw_cdx.startswith(' '):
+            counts['skip-cdx-raw'] += 1
+            continue
+        cdx = parse_full_cdx_line(raw_cdx)
+
+        if not ((cdx.status_code in ("200", "226") and cdx.mimetype in hit_mimetypes)
+                or (cdx.mimetype == "warc/revisit")):
+            counts['skip-cdx-scope'] += 1
+            continue
+
+        if cdx.mimetype == "application/octet-stream" and line.size_bytes and line.size_bytes != '-' and int(line.size_bytes) < 1000:
+            counts['skip-tiny-octetstream-'] += 1
+            continue
+
+        #print(time.time())
+        final_row = lookup_referrer_row(m, cdx.url)
+        #print(time.time())
+        if not final_row:
+            print("MISSING url: {}".format(raw_cdx.strip()))
+            counts['map-url-missing'] += 1
+            continue
+        if not (final_row.status_code in ("200", "226") and final_row.mimetype in hit_mimetypes):
+            counts['skip-map-scope'] += 1
+            continue
+        row = final_row
+        while row and row.referrer_url != None:
+            next_row = lookup_referrer_row(m, row.referrer_url)
+            if next_row:
+                row = next_row
+            else:
+                break
+   
+        initial_domain = urllib3.util.parse_url(row.url).host
+        final_domain = urllib3.util.parse_url(final_row.url).host
+        c.execute("INSERT INTO crawl_result VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
+            (row.url, None, initial_domain, final_row.breadcrumbs, final_row.url, final_domain, cdx.timestamp, final_row.status_code, cdx.sha1, final_row.mimetype, final_row.is_dedupe, True))
+        #print(final_row.breadcrumbs)
+        i = i+1
+        counts['inserted'] += 1
+        if i % 2000 == 0:
+            print("... backward {}".format(i))
+            output_db.commit()
+
+    output_db.commit()
+    print("Building indices (this can be slow)...")
+    c.executescript("""
+        CREATE INDEX IF NOT EXISTS result_initial_url on crawl_result (initial_url);
+        CREATE INDEX IF NOT EXISTS result_identifier on crawl_result (identifier);
+    """)
+    c.close()
+    m.close()
+    print("Backward map complete.")
+    print(counts)
+    return counts
+
+def backward(log_file, map_db, output_db, hit_mimetypes=FULLTEXT_MIMETYPES):
+    """
+    This is a variant of backward_cdx that uses the log files, not CDX file
+    """
+    print("Mapping backward from log file 200s to initial urls")
+    counts = collections.Counter({'inserted': 0})
+    m = map_db.cursor()
+    create_out_table(output_db)
+    c = output_db.cursor()
+    i = 0
+    for raw in log_file:
+        line = parse_crawl_line(raw)
+        if not line:
+            print("BAD LOG LINE: {}".format(raw.strip()))
+            continue
+        if line.url.startswith('dns:') or line.url.startswith('whois:'):
+            counts['skip-log-prereq'] += 1
+            continue
+        is_dedupe = 'duplicate:digest' in line.annotations
+
+        if not (line.status_code in ("200", "226") and line.mimetype in hit_mimetypes):
+            counts['skip-log-scope'] += 1
+            continue
+
+        if line.mimetype == "application/octet-stream" and int(line.size_bytes) < 1000:
+            counts['skip-tiny-octetstream-'] += 1
+            continue
+
+        #print(time.time())
+        final_row = lookup_referrer_row(m, line.url)
+        #print(time.time())
+        if not final_row:
+            print("MISSING url: {}".format(raw.strip()))
+            counts['map-url-missing'] += 1
+            continue
+        if not (final_row.status_code in ("200", "226") and final_row.mimetype in hit_mimetypes):
+            counts['skip-map-scope'] += 1
+            continue
+        row = final_row
+        while row and row.referrer_url != None:
+            next_row = lookup_referrer_row(m, row.referrer_url)
+            if next_row:
+                row = next_row
+            else:
+                break
+   
+        initial_domain = urllib3.util.parse_url(row.url).host
+        final_domain = urllib3.util.parse_url(final_row.url).host
+        # convert to IA CDX timestamp format
+        #final_timestamp = dateutil.parser.parse(line.timestamp).strftime("%Y%m%d%H%M%S")
+        final_timestamp = None
+        if len(line.timestamp) >= 12 and line.timestamp[4] != '-':
+            final_timestamp = line.timestamp[:12]
+        c.execute("INSERT INTO crawl_result VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
+            (row.url, None, initial_domain, final_row.breadcrumbs, final_row.url, final_domain, final_timestamp, final_row.status_code, line.sha1, final_row.mimetype, final_row.is_dedupe, True))
+        #print(final_row.breadcrumbs)
+        i = i+1
+        counts['inserted'] += 1
+        if i % 2000 == 0:
+            print("... backward {}".format(i))
+            output_db.commit()
+
+    output_db.commit()
+    m.close()
+    print("Building indices (this can be slow)...")
+    c.executescript("""
+        CREATE INDEX IF NOT EXISTS result_initial_url on crawl_result (initial_url);
+        CREATE INDEX IF NOT EXISTS result_identifier on crawl_result (identifier);
+    """)
+    c.close()
+    print("Backward map complete.")
+    print(counts)
+    return counts
+
+def forward(seed_id_file, map_db, output_db):
+    print("Mapping forwards from seedlist to terminal urls")
+    counts = collections.Counter({'inserted': 0})
+    m = map_db.cursor()
+    create_out_table(output_db)
+    c = output_db.cursor()
+
+    i = 0
+    for raw_line in seed_id_file:
+        line = raw_line.split('\t')
+        if not line:
+            counts['skip-raw-line'] += 1
+            continue
+        if len(line) == 1:
+            seed_url, identifier = line[0], None
+        elif len(line) == 2:
+            seed_url, identifier = line[0:2]
+        else:
+            print("WEIRD: {}".format(raw_line))
+            assert len(line) <= 2
+        raw_url = seed_url
+        seed_url = normalize_url(seed_url)
+        if not seed_url:
+            counts['skip-bad-seed-url'] += 1
+            continue
+        if raw_url != seed_url:
+            counts['_normalized-seed-url'] += 1
+
+        # first check if entry already in output table; if so, only upsert with identifier
+        existing_row = list(c.execute('SELECT identifier, breadcrumbs from crawl_result WHERE initial_url=? LIMIT 1', [seed_url]))
+        if existing_row:
+            if not existing_row[0][0]:
+                # identifier hasn't been updated
+                c.execute('UPDATE crawl_result SET identifier=? WHERE initial_url=?', [identifier, seed_url])
+                counts['existing-id-updated'] += 1
+                continue
+            else:
+                counts['existing-complete'] += 1
+                continue
+
+        # if not, then do a "forward" lookup for the "best"/"final" terminal crawl line
+        # simple for redirect case (no branching); arbitrary for the fan-out case
+        first_row = lookup_referrer_row(m, seed_url)
+        if not first_row:
+            print("MISSING url: {}".format(raw_line.strip()))
+            # need to insert *something* in this case...
+            initial_domain = urllib3.util.parse_url(seed_url).host
+            c.execute("INSERT INTO crawl_result VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
+                (seed_url, identifier, initial_domain, None, None, None, None, None, None, None, None, False))
+            counts['map-url-missing'] += 1
+            continue
+        row = first_row
+        # recursively iterate down referal path
+        limit = 40
+        while True:
+            limit = limit - 1
+            if limit <= 0:
+                counts['_redirect-recursion-limit'] += 1
+                break
+            next_rows = lookup_all_referred_rows(m, row.url)
+            if not next_rows:
+                # halt if we hit a dead end
+                break
+
+            # there are going to be multiple referrer hits, need to chose among... based on status?
+            updated = False
+            for potential in next_rows:
+                if ('E' in potential.breadcrumbs or 'X' in potential.breadcrumbs or 'I' in potential.breadcrumbs) and not 'pdf' in potential.mimetype:
+                    # TODO: still PDF-specific
+                    # don't consider simple embeds unless PDF
+                    continue
+                row = potential
+                updated = True
+            if not updated:
+                break
+
+        final_row = row
+        initial_domain = urllib3.util.parse_url(seed_url).host
+        final_domain = urllib3.util.parse_url(final_row.url).host
+        # TODO: would pass SHA1 here if we had it? but not stored in referrer table
+        # XXX: None => timestamp
+        c.execute("INSERT INTO crawl_result VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
+            (seed_url, identifier, initial_domain, final_row.breadcrumbs, final_row.url, final_domain, None, final_row.status_code, None, final_row.mimetype, final_row.is_dedupe, False))
+        #print(final_row.breadcrumbs)
+        i = i+1
+        counts['inserted'] += 1
+        if i % 2000 == 0:
+            print("... forward {}".format(i))
+            output_db.commit()
+
+    output_db.commit()
+    m.close()
+    print("Building indices (this can be slow)...")
+    c.executescript("""
+        CREATE INDEX IF NOT EXISTS result_initial_url on crawl_result (initial_url);
+        CREATE INDEX IF NOT EXISTS result_identifier on crawl_result (identifier);
+    """)
+    c.close()
+    print("Forward map complete.")
+    print(counts)
+    return counts
+
+def everything(log_file, seed_id_file, map_db, output_db, hit_mimetypes=FULLTEXT_MIMETYPES):
+    referrer(open(log_file, 'r'), map_db)
+    bcounts = backward(open(log_file, 'r'), map_db, output_db, hit_mimetypes=hit_mimetypes)
+    fcounts = forward(seed_id_file, map_db, output_db)
+    print()
+    print("Everything complete!")
+    print(bcounts)
+    print(fcounts)
+
+def main():
+    parser = argparse.ArgumentParser()
+    subparsers = parser.add_subparsers()
+
+    sub_referrer = subparsers.add_parser('referrer')
+    sub_referrer.set_defaults(func=referrer)
+    sub_referrer.add_argument("log_file",
+        default=sys.stdin, type=argparse.FileType('rt'))
+    sub_referrer.add_argument("map_db_file",
+        type=str)
+
+    sub_backward_cdx = subparsers.add_parser('backward_cdx')
+    sub_backward_cdx.set_defaults(func=backward_cdx)
+    sub_backward_cdx.add_argument("cdx_file",
+        default=sys.stdin, type=argparse.FileType('rt'))
+    sub_backward_cdx.add_argument("map_db_file",
+        type=str)
+    sub_backward_cdx.add_argument("output_db_file",
+        type=str)
+
+    sub_backward = subparsers.add_parser('backward')
+    sub_backward.set_defaults(func=backward)
+    sub_backward.add_argument("log_file",
+        default=sys.stdin, type=argparse.FileType('rt'))
+    sub_backward.add_argument("map_db_file",
+        type=str)
+    sub_backward.add_argument("output_db_file",
+        type=str)
+
+    sub_forward = subparsers.add_parser('forward')
+    sub_forward.set_defaults(func=forward)
+    sub_forward.add_argument("seed_id_file",
+        default=sys.stdin, type=argparse.FileType('rt'))
+    sub_forward.add_argument("map_db_file",
+        type=str)
+    sub_forward.add_argument("output_db_file",
+        type=str)
+
+    sub_everything = subparsers.add_parser('everything')
+    sub_everything.set_defaults(func=everything)
+    sub_everything.add_argument("log_file",
+        type=str)
+    sub_everything.add_argument("seed_id_file",
+        default=sys.stdin, type=argparse.FileType('rt'))
+    sub_everything.add_argument("output_db_file",
+        type=str)
+    sub_everything.add_argument("--map_db_file",
+        default=":memory:", type=str)
+
+    parser.add_argument("--html-hit",
+        action="store_true",
+        help="run in mode that considers only terminal HTML success")
+
+    args = parser.parse_args()
+    if not args.__dict__.get("func"):
+        print("tell me what to do! (try --help)")
+        sys.exit(-1)
+
+    if args.html_hit:
+        hit_mimetypes = (
+            "text/html",
+        )
+    else:
+        hit_mimetypes = FULLTEXT_MIMETYPES
+
+    if args.func is referrer:
+        referrer(args.log_file,
+                 sqlite3.connect(args.map_db_file, isolation_level='EXCLUSIVE'))
+    elif args.func is backward_cdx:
+        backward_cdx(args.cdx_file,
+                 sqlite3.connect(args.map_db_file, isolation_level='EXCLUSIVE'),
+                 sqlite3.connect(args.output_db_file, isolation_level='EXCLUSIVE'),
+                 hit_mimetypes=hit_mimetypes)
+    elif args.func is backward:
+        backward(args.log_file,
+                 sqlite3.connect(args.map_db_file, isolation_level='EXCLUSIVE'),
+                 sqlite3.connect(args.output_db_file, isolation_level='EXCLUSIVE'),
+                 hit_mimetypes=hit_mimetypes)
+    elif args.func is forward:
+        forward(args.seed_id_file,
+                sqlite3.connect(args.map_db_file, isolation_level='EXCLUSIVE'),
+                sqlite3.connect(args.output_db_file, isolation_level='EXCLUSIVE'))
+    elif args.func is everything:
+        everything(args.log_file,
+                 args.seed_id_file,
+                 sqlite3.connect(args.map_db_file),
+                 sqlite3.connect(args.output_db_file, isolation_level='EXCLUSIVE'),
+                 hit_mimetypes=hit_mimetypes)
+    else:
+        raise NotImplementedError
+
+if __name__ == '__main__':
+    main()
+
diff --git a/report_template.md b/report_template.md
new file mode 100644
index 0000000..139598b
--- /dev/null
+++ b/report_template.md
@@ -0,0 +1,108 @@
+
+# Crawl QA Report
+
+This crawl report is auto-generated from a sqlite database file, which should be available/included.
+
+### Seedlist Stats
+
+```sql
+SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT initial_domain) AS domains FROM crawl_result;
+```
+
+FTP seed URLs
+
+```sql
+SELECT COUNT(*) as ftp_urls FROM crawl_result WHERE initial_url LIKE 'ftp://%';
+```
+
+### Successful Hits
+
+```sql
+SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT final_sha1) as unique_sha1 FROM crawl_result WHERE hit=1;
+```
+
+De-duplication percentage (aka, fraction of hits where content had been crawled and identified previously):
+
+```sql
+# AVG() hack!
+SELECT 100. * AVG(final_was_dedupe) as percent FROM crawl_result WHERE hit=1;
+```
+
+Top mimetypes for successful hits (these are usually filtered to a fixed list in post-processing):
+
+```sql
+SELECT final_mimetype, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_mimetype ORDER BY COUNT(*) DESC LIMIT 10;
+```
+
+Most popular breadcrumbs (a measure of how hard the crawler had to work):
+
+```sql
+SELECT breadcrumbs, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY breadcrumbs ORDER BY COUNT(*) DESC LIMIT 10;
+```
+
+FTP vs. HTTP hits (200 is HTTP, 226 is FTP):
+
+```sql
+SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_status_code LIMIT 10;
+```
+
+### Domain Summary
+
+Top *initial* domains:
+
+```sql
+SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result) as percent FROM crawl_result GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20;
+```
+
+Top *successful, final* domains, where hits were found:
+
+```sql
+
+SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result WHERE hit=1) AS percent  FROM crawl_result WHERE hit=1 GROUP BY initial_domain ORDER BY COUNT(*) DESC LIMIT 20;
+```
+
+Top *non-successful, final* domains where crawl paths terminated before a successful hit (but crawl did run):
+
+```sql
+SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NOT NULL GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
+```
+
+Top *uncrawled, initial* domains, where the crawl didn't even attempt to run:
+
+```sql
+SELECT initial_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NULL GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20;
+```
+
+Top *blocked, final* domains:
+
+```sql
+SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND (final_status_code='-61' OR final_status_code='-2') GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
+```
+
+Top *rate-limited, final* domains:
+
+```sql
+SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code='429' GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
+```
+
+### Status Summary
+
+Top failure status codes:
+
+```sql
+    SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=0 GROUP BY final_status_code ORDER BY count(*) DESC LIMIT 10;
+```
+
+### Example Results
+
+A handful of random success lines:
+
+```sql
+    SELECT identifier, initial_url, breadcrumbs, final_url, final_sha1, final_mimetype FROM crawl_result WHERE hit=1 ORDER BY random() LIMIT 10;
+```
+
+Handful of random non-success lines:
+
+```sql
+    SELECT identifier, initial_url, breadcrumbs, final_url, final_status_code, final_mimetype FROM crawl_result WHERE hit=0 ORDER BY random() LIMIT 25;
+```
diff --git a/test.sqlite b/test.sqlite
new file mode 100644
index 0000000..a0435e6
--- /dev/null
+++ b/test.sqlite
diff --git a/test.tsv b/test.tsv
new file mode 100644
index 0000000..dbd9620
--- /dev/null
+++ b/test.tsv
@@ -0,0 +1,10 @@
+d6571a951347fb19c55a8d9d30b578e14b55be10
+f76971159f5c35b9c900eba23a757d47afd03fc9
+2fd5cc631fbac0cb6d737e357377ed482235487d
+17a2bdb7ca5aff57b20bdb7b72e893fce00304a0
+d487d844f6b0403113f814cfd6669b5007a371a7
+516144dac67a47bf23c0c9fa8530e95e8093105d
+35bb4705895240d3191e93601a16eb421bec850b
+f16c0eb11deb87f6194df7930652432b483192bc
+cfdeccc0a94df2e50316fbbd31b508eac14c9b15
+ba5d22f94bcf267a88d3f097d7b95f499c025c15
author	Bryan Newbold <bnewbold@archive.org>	2019-02-28 12:01:12 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2019-02-28 12:01:12 -0800
commit	dd50d0bb50549598da2e9ed595a12f154636ae2e (patch)
tree	7220a89aac25fbd77acdd516f8f4845ee3e44852
parent	b5f283913367d28578ad403b850e3a5c2e380887 (diff)
download	arabesque-dd50d0bb50549598da2e9ed595a12f154636ae2e.tar.gz arabesque-dd50d0bb50549598da2e9ed595a12f154636ae2e.zip