diff options
-rw-r--r-- | skate/cmd/skate-cdx-lookup/main.go | 28 |
1 files changed, 21 insertions, 7 deletions
diff --git a/skate/cmd/skate-cdx-lookup/main.go b/skate/cmd/skate-cdx-lookup/main.go index 2e43b8a..9822c90 100644 --- a/skate/cmd/skate-cdx-lookup/main.go +++ b/skate/cmd/skate-cdx-lookup/main.go @@ -1,12 +1,26 @@ -// skate-cdx-lookup is a lookup tool for small and large lists of URLs. We try -// to read from HDFS in parallel and cache some mapping information locally -// for fast access. +// skate-cdx-lookup is a lookup tool for small and large lists of URLs. We try +// to read from HDFS in parallel and cache some mapping information locally for +// fast access. // -// What we want: Lookup 10-100M URLs and report, whether we have it or not. -// Also make this a bit more generic, so we can lookup all kinds of things in -// the CDX index. +// What we want: Lookup 10-100M URLs quickly and report, whether the URL is in +// GWB or not. Also make this a bit more generic, so we can lookup other +// things in the CDX index. // -// Alternatives: Spark, Sparkling, Pig, Hive, ... +// As of 04/2021 the CDX is split into 300 files, each around 230G, for a total +// of 70T (compressed, maybe 350T plain). Each file comes with a 90M index +// containing about 1M lines (with surt, offset, ...). +// +// Test run and tiny design: +// +// * [ ] accept sorted input only +// * [ ] get first URL, find the corresponding index file +// +// Raw index; only HTTP 200, or redirect; include everything; random URL from a +// source; popular URL; hundreds of captures; filter the dump! SURT; huge +// efficiency; PIG; +// https://git.archive.org/webgroup/sandcrawler/-/tree/master/pig +// +// Alternatives: Spark, Sparkling, Pig, Hive, Java MR, ... // // We take advantage of index files and sorted data. The complete dataset is // 66TB, gzip compressed. We do not need compute to be distrubuted, as a single |