aboutsummaryrefslogtreecommitdiffstats
path: root/report_template.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-02-28 12:32:28 -0800
committerBryan Newbold <bnewbold@archive.org>2019-02-28 12:32:28 -0800
commitfc4ca558e329d878a430dd5241bf0195e0998f10 (patch)
tree090581ebb0db9eba9de009ba9bcfcdb67a017f4f /report_template.md
parentdd50d0bb50549598da2e9ed595a12f154636ae2e (diff)
downloadarabesque-fc4ca558e329d878a430dd5241bf0195e0998f10.tar.gz
arabesque-fc4ca558e329d878a430dd5241bf0195e0998f10.zip
update examples; create README
Diffstat (limited to 'report_template.md')
-rw-r--r--report_template.md108
1 files changed, 0 insertions, 108 deletions
diff --git a/report_template.md b/report_template.md
deleted file mode 100644
index 139598b..0000000
--- a/report_template.md
+++ /dev/null
@@ -1,108 +0,0 @@
-
-# Crawl QA Report
-
-This crawl report is auto-generated from a sqlite database file, which should be available/included.
-
-### Seedlist Stats
-
-```sql
-SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT initial_domain) AS domains FROM crawl_result;
-```
-
-FTP seed URLs
-
-```sql
-SELECT COUNT(*) as ftp_urls FROM crawl_result WHERE initial_url LIKE 'ftp://%';
-```
-
-### Successful Hits
-
-```sql
-SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT final_sha1) as unique_sha1 FROM crawl_result WHERE hit=1;
-```
-
-De-duplication percentage (aka, fraction of hits where content had been crawled and identified previously):
-
-```sql
-# AVG() hack!
-SELECT 100. * AVG(final_was_dedupe) as percent FROM crawl_result WHERE hit=1;
-```
-
-Top mimetypes for successful hits (these are usually filtered to a fixed list in post-processing):
-
-```sql
-SELECT final_mimetype, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_mimetype ORDER BY COUNT(*) DESC LIMIT 10;
-```
-
-Most popular breadcrumbs (a measure of how hard the crawler had to work):
-
-```sql
-SELECT breadcrumbs, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY breadcrumbs ORDER BY COUNT(*) DESC LIMIT 10;
-```
-
-FTP vs. HTTP hits (200 is HTTP, 226 is FTP):
-
-```sql
-SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_status_code LIMIT 10;
-```
-
-### Domain Summary
-
-Top *initial* domains:
-
-```sql
-SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result) as percent FROM crawl_result GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20;
-```
-
-Top *successful, final* domains, where hits were found:
-
-```sql
-
-SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result WHERE hit=1) AS percent FROM crawl_result WHERE hit=1 GROUP BY initial_domain ORDER BY COUNT(*) DESC LIMIT 20;
-```
-
-Top *non-successful, final* domains where crawl paths terminated before a successful hit (but crawl did run):
-
-```sql
-SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NOT NULL GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
-```
-
-Top *uncrawled, initial* domains, where the crawl didn't even attempt to run:
-
-```sql
-SELECT initial_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NULL GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20;
-```
-
-Top *blocked, final* domains:
-
-```sql
-SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND (final_status_code='-61' OR final_status_code='-2') GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
-```
-
-Top *rate-limited, final* domains:
-
-```sql
-SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code='429' GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
-```
-
-### Status Summary
-
-Top failure status codes:
-
-```sql
- SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=0 GROUP BY final_status_code ORDER BY count(*) DESC LIMIT 10;
-```
-
-### Example Results
-
-A handful of random success lines:
-
-```sql
- SELECT identifier, initial_url, breadcrumbs, final_url, final_sha1, final_mimetype FROM crawl_result WHERE hit=1 ORDER BY random() LIMIT 10;
-```
-
-Handful of random non-success lines:
-
-```sql
- SELECT identifier, initial_url, breadcrumbs, final_url, final_status_code, final_mimetype FROM crawl_result WHERE hit=0 ORDER BY random() LIMIT 25;
-```