blob: 3858361c223275b9cc3f88739ced8b8d86d07f1f (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
The domain isiarticles.com hosts a bunch of partial spam PDFs.
As a first pass, we can remove these via the domain itself.
A "blocklist" for this domain has been added to sandcrawler, so they should not
get auto-ingested in the future.
# 2022-04-20
fatcat-cli search file domain:isiarticles.com --count
25067
## Prod Cleanup
See bulk edits log.
Verify cleanup:
fatcat-cli search file domain:isiarticles.com '!content_scope:*' --count
0
|