1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
|
Simply de-duplicating container entities on the basis of ISSN-L.
Initial plan is to:
- only merge containers with zero (0) release entities pointing at them
- not update any containers which have had human edits
- not merge additional metadata from redirected entities to the "primary" entity
## Prep
Using commands from `check_issnl.sh`:
zcat container_export.json.gz \
| jq '[.issnl, .ident] | @tsv' -r \
| sort -S 4G \
| uniq -D -w 9 \
> issnl_ident.dupes.tsv
wc -l issnl_ident.dupes.tsv
# 3174 issnl_ident.dupes.tsv
cut -f1 issnl_ident.dupes.tsv | uniq | wc -l
# 835
Run transform script:
cat issnl_ident.dupes.tsv | ./container_dupe_to_json.py | pv -l > container_issnl_dupes.json
Create a small random sample:
shuf -n100 container_issnl_dupes.json > container_issnl_dupes.sample.json
## QA Testing
git log | head -n1
# commit e72d61e60c43911b6d77c4842951441235561dcf
export FATCAT_AUTH_API_TOKEN=[...]
head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
| python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
Got various errors and patched them:
AttributeError: 'EntityHistoryEntry' object has no attribute 'editor'
requests.exceptions.HTTPError: 404 Client Error: NOT FOUND for url: https://fatcat.wiki/container/%7Bident%7D/stats.json
fatcat_openapi_client.exceptions.ApiValueError: Missing the required parameter `editgroup_id` when calling `accept_editgroup`
Run again:
head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
| python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
# Running in dry-run mode!
# Counter({'updated-entities': 96, 'skip-container-release-count': 84, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
Finally! dry-run mode actually worked. Try entire sample in dry-run:
cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
| python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
# Running in dry-run mode!
# Counter({'updated-entities': 310, 'skip-container-release-count': 251, 'lines': 100, 'merged': 100, 'skip': 0, 'updated-total': 0})
How about a small `max-container-releases`:
cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
| python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
# Running in dry-run mode!
# Counter({'updated-entities': 310, 'skip-container-release-count': 251, 'lines': 100, 'merged': 100, 'skip': 0, 'updated-total': 0})
Exact same count... maybe something isn't working? Debugged and fixed it.
requests.exceptions.HTTPError: 503 Server Error: SERVICE UNAVAILABLE for url: https://fatcat.wiki/container/xn7i2sdijzbypcetz77kttj76y/stats.json
# Running in dry-run mode!
# Counter({'updated-entities': 310, 'lines': 100, 'merged': 100, 'skip-container-release-count': 92, 'skip': 0, 'updated-total': 0})
From skimming, it looks like 100 is probably a good cut-off. There are sort of
a lot of these dupes!
Try some actual merges:
head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
| python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers -
# Counter({'updated-entities': 96, 'skip-container-release-count': 84, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
Run immediately again:
# Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'skip-container-release-count': 2, 'merged': 0, 'updated-total': 0})
Run all the samples, with limit of 100 releases:
cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
| python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers - --max-container-releases 100
# Counter({'updated-entities': 214, 'lines': 100, 'merged': 75, 'skip': 25, 'skip-not-active-entity': 25, 'skip-container-release-count': 15, 'updated-total': 0})
Wow, there are going to be a lot of these containers not merged because they
have so many releases! Will have to do a second, more carefully reviewed (?)
round of merging.
Unfortunately, not seeing any human-edited container entities here to check if
that filter is working.
|