aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2021-11-24_file_meta.md
blob: 1ec1698b19de35fcc86f50750a66fdd2d9cd1b7f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

Another partial batch of pure `file_meta` updates to file entities. These came
from re-attempting ingest by URL of existing file entities.

Not all ran as expected, partially because of GROBID issues, and partially
because we had alternate captures for the same URLs.

Still, about half the attempts worked, so we are going to update a fraction of
the ~520k outstanding file entities with partial metadata (eg, missing sha256).

See cleanups `file_meta` document for prep and QA testing notes.


## Production Commands

    git log | head -n1
    commit 75bde4ad3970e8e63b04009cfd16ed4b9a924ce7

    export export FATCAT_AUTH_API_TOKEN=[...]  # sandcrawler-bot

Start with a small sample:

    cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.sample.json \
        | ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
    # Counter({'total': 100, 'skip-existing-complete': 45, 'update': 43, 'skip-no-match': 12, 'skip': 0, 'insert': 0, 'exists': 0})

Then run in parallel with full batch:

    cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.json \
        | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
    # Counter({'total': 41846, 'update': 19737, 'skip-existing-complete': 18788, 'skip-no-match': 3321, 'skip': 0, 'insert': 0, 'exists': 0})
    # Counter({'total': 41522, 'update': 19678, 'skip-existing-complete': 18607, 'skip-no-match': 3237, 'skip': 0, 'insert': 0, 'exists': 0})
    # Counter({'total': 41537, 'update': 20517, 'skip-existing-complete': 17895, 'skip-no-match': 3125, 'skip': 0, 'insert': 0, 'exists': 0})
    # Counter({'total': 41529, 'update': 19684, 'skip-existing-complete': 18501, 'skip-no-match': 3344, 'skip': 0, 'insert': 0, 'exists': 0})
    # Counter({'total': 41530, 'update': 19595, 'skip-existing-complete': 18637, 'skip-no-match': 3298, 'skip': 0, 'insert': 0, 'exists': 0})
    # Counter({'total': 41542, 'update': 21359, 'skip-existing-complete': 17033, 'skip-no-match': 3150, 'skip': 0, 'insert': 0, 'exists': 0})
    # Counter({'total': 41534, 'update': 19758, 'skip-existing-complete': 18516, 'skip-no-match': 3260, 'skip': 0, 'insert': 0, 'exists': 0})
    # Counter({'total': 41537, 'update': 20507, 'skip-existing-complete': 15543, 'skip-no-match': 5487, 'skip': 0, 'insert': 0, 'exists': 0})

Import ran pretty fast! Updated about 160k file entities. More like 1/3 than
1/2 of the 520k that were missing SHA-256.