diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2022-04-20 16:05:29 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2022-04-20 16:05:29 -0700 |
commit | 3a8dada3267c56fd62b84201b4af96889e4103e6 (patch) | |
tree | 00278249dc6879e2ad0d4c617263cdd6265516f9 /extra/cleanups/file_isiarticles.md | |
parent | cf7412634e3a6935d3f8f8a482d35242b7b17018 (diff) | |
download | fatcat-3a8dada3267c56fd62b84201b4af96889e4103e6.tar.gz fatcat-3a8dada3267c56fd62b84201b4af96889e4103e6.zip |
cleanups: isiarticles
Diffstat (limited to 'extra/cleanups/file_isiarticles.md')
-rw-r--r-- | extra/cleanups/file_isiarticles.md | 15 |
1 files changed, 15 insertions, 0 deletions
diff --git a/extra/cleanups/file_isiarticles.md b/extra/cleanups/file_isiarticles.md new file mode 100644 index 00000000..cb3785af --- /dev/null +++ b/extra/cleanups/file_isiarticles.md @@ -0,0 +1,15 @@ + +The domain isiarticles.com hosts a bunch of partial spam PDFs. + +As a first pass, we can remove these via the domain itself. + +A "blocklist" for this domain has been added to sandcrawler, so they should not +get auto-ingested in the future. + + # 2022-04-20 + fatcat-cli search file domain:isiarticles.com --count + 25067 + +## Prod Cleanup + +See bulk edits log. |