update notes

author: Martin Czygan <martin.czygan@gmail.com> 2021-04-08 19:20:23 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-04-19 20:29:17 +0200
commit: bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4 (patch)
tree: 2bf4f712809990474416a3e77c0ffa76ceba470a /skate/cmd
parent: c8f2e93e1ea6542291cf977f0957ed7786f00766 (diff)
download: refcat-bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4.tar.gz
refcat-bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4.zip
1 files changed, 21 insertions, 7 deletions
diff --git a/skate/cmd/skate-cdx-lookup/main.go b/skate/cmd/skate-cdx-lookup/main.go
index 2e43b8a..9822c90 100644
--- a/skate/cmd/skate-cdx-lookup/main.go
+++ b/skate/cmd/skate-cdx-lookup/main.go
@@ -1,12 +1,26 @@
-// skate-cdx-lookup is a lookup tool for small and large lists of URLs.  We try
-// to read from HDFS in parallel and cache some mapping information locally
-// for fast access.
+// skate-cdx-lookup is a lookup tool for small and large lists of URLs. We try
+// to read from HDFS in parallel and cache some mapping information locally for
+// fast access.
 //
-// What we want: Lookup 10-100M URLs and report, whether we have it or not.
-// Also make this a bit more generic, so we can lookup all kinds of things in
-// the CDX index.
+// What we want: Lookup 10-100M URLs quickly and report, whether the URL is in
+// GWB or not.  Also make this a bit more generic, so we can lookup other
+// things in the CDX index.
 //
-// Alternatives: Spark, Sparkling, Pig, Hive, ...
+// As of 04/2021 the CDX is split into 300 files, each around 230G, for a total
+// of 70T (compressed, maybe 350T plain). Each file comes with a 90M index
+// containing about 1M lines (with surt, offset, ...).
+//
+// Test run and tiny design:
+//
+// * [ ] accept sorted input only
+// * [ ] get first URL, find the corresponding index file
+//
+// Raw index; only HTTP 200, or redirect; include everything; random URL from a
+// source; popular URL; hundreds of captures; filter the dump! SURT; huge
+// efficiency; PIG;
+// https://git.archive.org/webgroup/sandcrawler/-/tree/master/pig
+//
+// Alternatives: Spark, Sparkling, Pig, Hive, Java MR, ...
 //
 // We take advantage of index files and sorted data. The complete dataset is
 // 66TB, gzip compressed. We do not need compute to be distrubuted, as a single
author	Martin Czygan <martin.czygan@gmail.com>	2021-04-08 19:20:23 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-04-19 20:29:17 +0200
commit	bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4 (patch)
tree	2bf4f712809990474416a3e77c0ffa76ceba470a /skate/cmd
parent	c8f2e93e1ea6542291cf977f0957ed7786f00766 (diff)
download	refcat-bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4.tar.gz refcat-bc85195a5e2b06fdf02c9f946d4e3f109f4f40b4.zip