extra/cleanups/file_single_page.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76


In the past we have crawled and imported many PDF files from Springer, which in
fact are only first-page samples of articles, not the entire fulltext work itself.

The original domain these are served from is:

- `page-one.live.cf.public.springer.com`

The cleanup for these is to update the files with:

1. set the `content_scope` field to `sample`, and remove any `release_ids`
2. in downstream contexts, don't treat `content_scope=sample` as a complete entity

An alternative would be to just delete these files. But this would likely
result in them being re-imported in the future.

We might also want to retain the `release_ids` linkage between files and
releases. The current semantics of file/release are that the file represents a
valid preservation copy of the release though, and that is not the case here.
In this specific case the relationship is partially preserved in edit history,
and could be resurected programatically if needed.

## Existing Files with URL

The easy case to detect is when these files are in fatcat with the original
URL. How many files fall in that cateogry?

    zcat file_export.json.gz \
        | rg '//page-one.live.cf.public.springer.com/' \
        | pv -l \
        | pigz \
        > files_pageone.json.gz
    # 24.4k 0:10:03 [40.5 /s]

## Existing Files without URL

After a partial cleanup, export, and re-load of the `release_file` table in
sandcrawler-db, we could do a join between the `fatcat_file` and the `cdx`
table to identify PDFs which have ever been crawled from
`page-one.live.cf.public.springer.com` and have an entity in fatcat.


## QA

    export FATCAT_API_HOST=https://api.qa.fatcat.wiki
    export FATCAT_AUTH_WORKER_CLEANUP=[...]
    export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP

    fatcat-cli --version
    fatcat-cli 0.1.6


    fatcat-cli status
         API Version: 0.5.0 (local)
            API host: https://api.qa.fatcat.wiki [successfully connected]
      Last changelog: 5326137
      API auth token: [configured]
             Account: cleanup-bot [bot] [admin] [active]
                      editor_vvnmtzskhngxnicockn4iavyxq

    zcat /srv/fatcat/datasets/files_pageone.json.gz \
        | jq '"file_" + .ident' -r \
        | head -n100 \
        | parallel -j1 fatcat-cli get {} --json \
        | jq . -c \
        | rg -v '"content_scope"' \
        | rg 'page-one.live.cf.public.springer.com' \
        | pv -l \
        | fatcat-cli batch update file release_ids= content_scope=sample --description 'Un-link and mark Springer "page-one" preview PDF files as content_scope=sample' --auto-accept

Looks good!


## Prod

See bulk edits log.