Accidentally seem to have backfilled many CDX lines with non-PDF content.
Should clear these out!

Something like:

    mimetype = 'text/html'
    not in file_meta

Or maybe instead:

    mimetype = 'text/html'
    not in file_meta

SQL:

    SELECT *        FROM cdx WHERE mimetype = 'text/html' AND row_created < '2019-10-01' LIMIT 5;
    SELECT COUNT(1) FROM cdx WHERE mimetype = 'text/html' AND row_created < '2019-10-01';
    => 24841846

    SELECT *        FROM cdx LEFT JOIN file_meta ON file_meta.sha1hex = cdx.sha1hex WHERE cdx.mimetype = 'text/html' AND file_meta.sha256hex IS NULL LIMIT 5;
    SELECT COUNT(1) FROM cdx LEFT JOIN file_meta ON cdx.sha1hex = file_meta.sha1hex WHERE cdx.mimetype = 'text/html' AND file_meta.sha256hex IS NULL;
    => 24547552

    DELETE FROM cdx
      WHERE sha1hex IN
      (SELECT cdx.sha1hex
       FROM cdx
       LEFT JOIN file_meta ON file_meta.sha1hex = cdx.sha1hex
       WHERE cdx.mimetype = 'text/html' AND file_meta.sha256hex IS NULL);
    => DELETE 24553428

Slightly more... probably should have had a "AND cdx.mimetype = 'text/html'" in
the DELETE WHERE clause.