blob: 54db92efd9c91e540f19b410febf1fdcc494f863 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
Accidentally seem to have backfilled many CDX lines with non-PDF content.
Should clear these out!
Something like:
mimetype = 'text/html'
not in file_meta
Or maybe instead:
mimetype = 'text/html'
not in file_meta
SQL:
SELECT * FROM cdx WHERE mimetype = 'text/html' AND row_created < '2019-10-01' LIMIT 5;
SELECT COUNT(1) FROM cdx WHERE mimetype = 'text/html' AND row_created < '2019-10-01';
=> 24841846
SELECT * FROM cdx LEFT JOIN file_meta ON file_meta.sha1hex = cdx.sha1hex WHERE cdx.mimetype = 'text/html' AND file_meta.sha256hex IS NULL LIMIT 5;
SELECT COUNT(1) FROM cdx LEFT JOIN file_meta ON cdx.sha1hex = file_meta.sha1hex WHERE cdx.mimetype = 'text/html' AND file_meta.sha256hex IS NULL;
=> 24547552
DELETE FROM cdx
WHERE sha1hex IN
(SELECT cdx.sha1hex
FROM cdx
LEFT JOIN file_meta ON file_meta.sha1hex = cdx.sha1hex
WHERE cdx.mimetype = 'text/html' AND file_meta.sha256hex IS NULL);
=> DELETE 24553428
Slightly more... probably should have had a "AND cdx.mimetype = 'text/html'" in
the DELETE WHERE clause.
|