aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--skate/cmd/skate-cdx-lookup/main.go28
1 files changed, 21 insertions, 7 deletions
diff --git a/skate/cmd/skate-cdx-lookup/main.go b/skate/cmd/skate-cdx-lookup/main.go
index 2e43b8a..9822c90 100644
--- a/skate/cmd/skate-cdx-lookup/main.go
+++ b/skate/cmd/skate-cdx-lookup/main.go
@@ -1,12 +1,26 @@
-// skate-cdx-lookup is a lookup tool for small and large lists of URLs. We try
-// to read from HDFS in parallel and cache some mapping information locally
-// for fast access.
+// skate-cdx-lookup is a lookup tool for small and large lists of URLs. We try
+// to read from HDFS in parallel and cache some mapping information locally for
+// fast access.
//
-// What we want: Lookup 10-100M URLs and report, whether we have it or not.
-// Also make this a bit more generic, so we can lookup all kinds of things in
-// the CDX index.
+// What we want: Lookup 10-100M URLs quickly and report, whether the URL is in
+// GWB or not. Also make this a bit more generic, so we can lookup other
+// things in the CDX index.
//
-// Alternatives: Spark, Sparkling, Pig, Hive, ...
+// As of 04/2021 the CDX is split into 300 files, each around 230G, for a total
+// of 70T (compressed, maybe 350T plain). Each file comes with a 90M index
+// containing about 1M lines (with surt, offset, ...).
+//
+// Test run and tiny design:
+//
+// * [ ] accept sorted input only
+// * [ ] get first URL, find the corresponding index file
+//
+// Raw index; only HTTP 200, or redirect; include everything; random URL from a
+// source; popular URL; hundreds of captures; filter the dump! SURT; huge
+// efficiency; PIG;
+// https://git.archive.org/webgroup/sandcrawler/-/tree/master/pig
+//
+// Alternatives: Spark, Sparkling, Pig, Hive, Java MR, ...
//
// We take advantage of index files and sorted data. The complete dataset is
// 66TB, gzip compressed. We do not need compute to be distrubuted, as a single