aboutsummaryrefslogtreecommitdiffstats
path: root/extra/cleanups/file_isiarticles.md
diff options
context:
space:
mode:
Diffstat (limited to 'extra/cleanups/file_isiarticles.md')
-rw-r--r--extra/cleanups/file_isiarticles.md20
1 files changed, 20 insertions, 0 deletions
diff --git a/extra/cleanups/file_isiarticles.md b/extra/cleanups/file_isiarticles.md
new file mode 100644
index 00000000..3858361c
--- /dev/null
+++ b/extra/cleanups/file_isiarticles.md
@@ -0,0 +1,20 @@
+
+The domain isiarticles.com hosts a bunch of partial spam PDFs.
+
+As a first pass, we can remove these via the domain itself.
+
+A "blocklist" for this domain has been added to sandcrawler, so they should not
+get auto-ingested in the future.
+
+ # 2022-04-20
+ fatcat-cli search file domain:isiarticles.com --count
+ 25067
+
+## Prod Cleanup
+
+See bulk edits log.
+
+Verify cleanup:
+
+ fatcat-cli search file domain:isiarticles.com '!content_scope:*' --count
+ 0