aboutsummaryrefslogtreecommitdiffstats
path: root/extra/cleanups/file_isiarticles.md
blob: 3858361c223275b9cc3f88739ced8b8d86d07f1f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

The domain isiarticles.com hosts a bunch of partial spam PDFs.

As a first pass, we can remove these via the domain itself.

A "blocklist" for this domain has been added to sandcrawler, so they should not
get auto-ingested in the future.

    # 2022-04-20
    fatcat-cli search file domain:isiarticles.com --count
    25067

## Prod Cleanup

See bulk edits log.

Verify cleanup:

    fatcat-cli search file domain:isiarticles.com '!content_scope:*' --count
    0