diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-04-08 19:20:23 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-04-19 20:29:17 +0200 |
commit | bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4 (patch) | |
tree | 2bf4f712809990474416a3e77c0ffa76ceba470a /skate | |
parent | c8f2e93e1ea6542291cf977f0957ed7786f00766 (diff) | |
download | refcat-bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4.tar.gz refcat-bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4.zip |
update notes
Diffstat (limited to 'skate')
-rw-r--r-- | skate/cmd/skate-cdx-lookup/main.go | 28 |
1 files changed, 21 insertions, 7 deletions
diff --git a/skate/cmd/skate-cdx-lookup/main.go b/skate/cmd/skate-cdx-lookup/main.go index 2e43b8a..9822c90 100644 --- a/skate/cmd/skate-cdx-lookup/main.go +++ b/skate/cmd/skate-cdx-lookup/main.go @@ -1,12 +1,26 @@ -// skate-cdx-lookup is a lookup tool for small and large lists of URLs. We try -// to read from HDFS in parallel and cache some mapping information locally -// for fast access. +// skate-cdx-lookup is a lookup tool for small and large lists of URLs. We try +// to read from HDFS in parallel and cache some mapping information locally for +// fast access. // -// What we want: Lookup 10-100M URLs and report, whether we have it or not. -// Also make this a bit more generic, so we can lookup all kinds of things in -// the CDX index. +// What we want: Lookup 10-100M URLs quickly and report, whether the URL is in +// GWB or not. Also make this a bit more generic, so we can lookup other +// things in the CDX index. // -// Alternatives: Spark, Sparkling, Pig, Hive, ... +// As of 04/2021 the CDX is split into 300 files, each around 230G, for a total +// of 70T (compressed, maybe 350T plain). Each file comes with a 90M index +// containing about 1M lines (with surt, offset, ...). +// +// Test run and tiny design: +// +// * [ ] accept sorted input only +// * [ ] get first URL, find the corresponding index file +// +// Raw index; only HTTP 200, or redirect; include everything; random URL from a +// source; popular URL; hundreds of captures; filter the dump! SURT; huge +// efficiency; PIG; +// https://git.archive.org/webgroup/sandcrawler/-/tree/master/pig +// +// Alternatives: Spark, Sparkling, Pig, Hive, Java MR, ... // // We take advantage of index files and sorted data. The complete dataset is // 66TB, gzip compressed. We do not need compute to be distrubuted, as a single |