summaryrefslogtreecommitdiffstats
path: root/extra/cleanups/container_issnl_dedupe.md
blob: a76bc96168fe0ac48285ce5d11445948ffecebaa (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

Simply de-duplicating container entities on the basis of ISSN-L.

Initial plan is to:

- only merge containers with zero (0) release entities pointing at them
- not update any containers which have had human edits
- not merge additional metadata from redirected entities to the "primary" entity


## Prep

Using commands from `check_issnl.sh`:

    zcat container_export.json.gz \
        | jq '[.issnl, .ident] | @tsv' -r \
        | sort -S 4G \
        | uniq -D -w 9 \
        > issnl_ident.dupes.tsv

    wc -l issnl_ident.dupes.tsv
    # 3174 issnl_ident.dupes.tsv

    cut -f1 issnl_ident.dupes.tsv | uniq | wc -l
    # 835

Run transform script:

    cat issnl_ident.dupes.tsv | ./container_dupe_to_json.py | pv -l > container_issnl_dupes.json

Create a small random sample:

    shuf -n100 container_issnl_dupes.json > container_issnl_dupes.sample.json

## QA Testing

    git log | head -n1
    # commit e72d61e60c43911b6d77c4842951441235561dcf

    export FATCAT_AUTH_API_TOKEN=[...]

    head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -

Got various errors and patched them:

    AttributeError: 'EntityHistoryEntry' object has no attribute 'editor'

    requests.exceptions.HTTPError: 404 Client Error: NOT FOUND for url: https://fatcat.wiki/container/%7Bident%7D/stats.json

    fatcat_openapi_client.exceptions.ApiValueError: Missing the required parameter `editgroup_id` when calling `accept_editgroup`

Run again:

    head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
    # Running in dry-run mode!
    # Counter({'updated-entities': 96, 'skip-container-release-count': 84, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})

Finally! dry-run mode actually worked. Try entire sample in dry-run:

    cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
    # Running in dry-run mode!
    # Counter({'updated-entities': 310, 'skip-container-release-count': 251, 'lines': 100, 'merged': 100, 'skip': 0, 'updated-total': 0})

How about a small `max-container-releases`:

    cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
    # Running in dry-run mode!
    # Counter({'updated-entities': 310, 'skip-container-release-count': 251, 'lines': 100, 'merged': 100, 'skip': 0, 'updated-total': 0})

Exact same count... maybe something isn't working? Debugged and fixed it.

    requests.exceptions.HTTPError: 503 Server Error: SERVICE UNAVAILABLE for url: https://fatcat.wiki/container/xn7i2sdijzbypcetz77kttj76y/stats.json

    # Running in dry-run mode!
    # Counter({'updated-entities': 310, 'lines': 100, 'merged': 100, 'skip-container-release-count': 92, 'skip': 0, 'updated-total': 0})

From skimming, it looks like 100 is probably a good cut-off. There are sort of
a lot of these dupes!

Try some actual merges:

    head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers -
    # Counter({'updated-entities': 96, 'skip-container-release-count': 84, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})

Run immediately again:

    # Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'skip-container-release-count': 2, 'merged': 0, 'updated-total': 0})

Run all the samples, with limit of 100 releases:

    cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers - --max-container-releases 100
    # Counter({'updated-entities': 214, 'lines': 100, 'merged': 75, 'skip': 25, 'skip-not-active-entity': 25, 'skip-container-release-count': 15, 'updated-total': 0})

Wow, there are going to be a lot of these containers not merged because they
have so many releases! Will have to do a second, more carefully reviewed (?)
round of merging.

Unfortunately, not seeing any human-edited container entities here to check if
that filter is working.