diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-02-28 12:32:28 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-02-28 12:32:28 -0800 |
commit | fc4ca558e329d878a430dd5241bf0195e0998f10 (patch) | |
tree | 090581ebb0db9eba9de009ba9bcfcdb67a017f4f /report_template.md | |
parent | dd50d0bb50549598da2e9ed595a12f154636ae2e (diff) | |
download | arabesque-fc4ca558e329d878a430dd5241bf0195e0998f10.tar.gz arabesque-fc4ca558e329d878a430dd5241bf0195e0998f10.zip |
update examples; create README
Diffstat (limited to 'report_template.md')
-rw-r--r-- | report_template.md | 108 |
1 files changed, 0 insertions, 108 deletions
diff --git a/report_template.md b/report_template.md deleted file mode 100644 index 139598b..0000000 --- a/report_template.md +++ /dev/null @@ -1,108 +0,0 @@ - -# Crawl QA Report - -This crawl report is auto-generated from a sqlite database file, which should be available/included. - -### Seedlist Stats - -```sql -SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT initial_domain) AS domains FROM crawl_result; -``` - -FTP seed URLs - -```sql -SELECT COUNT(*) as ftp_urls FROM crawl_result WHERE initial_url LIKE 'ftp://%'; -``` - -### Successful Hits - -```sql -SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT final_sha1) as unique_sha1 FROM crawl_result WHERE hit=1; -``` - -De-duplication percentage (aka, fraction of hits where content had been crawled and identified previously): - -```sql -# AVG() hack! -SELECT 100. * AVG(final_was_dedupe) as percent FROM crawl_result WHERE hit=1; -``` - -Top mimetypes for successful hits (these are usually filtered to a fixed list in post-processing): - -```sql -SELECT final_mimetype, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_mimetype ORDER BY COUNT(*) DESC LIMIT 10; -``` - -Most popular breadcrumbs (a measure of how hard the crawler had to work): - -```sql -SELECT breadcrumbs, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY breadcrumbs ORDER BY COUNT(*) DESC LIMIT 10; -``` - -FTP vs. HTTP hits (200 is HTTP, 226 is FTP): - -```sql -SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_status_code LIMIT 10; -``` - -### Domain Summary - -Top *initial* domains: - -```sql -SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result) as percent FROM crawl_result GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20; -``` - -Top *successful, final* domains, where hits were found: - -```sql - -SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result WHERE hit=1) AS percent FROM crawl_result WHERE hit=1 GROUP BY initial_domain ORDER BY COUNT(*) DESC LIMIT 20; -``` - -Top *non-successful, final* domains where crawl paths terminated before a successful hit (but crawl did run): - -```sql -SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NOT NULL GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20; -``` - -Top *uncrawled, initial* domains, where the crawl didn't even attempt to run: - -```sql -SELECT initial_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NULL GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20; -``` - -Top *blocked, final* domains: - -```sql -SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND (final_status_code='-61' OR final_status_code='-2') GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20; -``` - -Top *rate-limited, final* domains: - -```sql -SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code='429' GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20; -``` - -### Status Summary - -Top failure status codes: - -```sql - SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=0 GROUP BY final_status_code ORDER BY count(*) DESC LIMIT 10; -``` - -### Example Results - -A handful of random success lines: - -```sql - SELECT identifier, initial_url, breadcrumbs, final_url, final_sha1, final_mimetype FROM crawl_result WHERE hit=1 ORDER BY random() LIMIT 10; -``` - -Handful of random non-success lines: - -```sql - SELECT identifier, initial_url, breadcrumbs, final_url, final_status_code, final_mimetype FROM crawl_result WHERE hit=0 ORDER BY random() LIMIT 25; -``` |