1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
|
# Version 1
Includes:
* doi, pmid, pmcid, arxiv
* title-lower exact matches
Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g.
"introduction". 180G compressed, about 53 min for a one pass.
```
$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \
<(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst
```
Filter and sample with `awk`, e.g. via:
```
$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0'
```
Need to pre-filter before join, to keep join smaller.
Basic inspection of the "exact lower title" set.
* 16B+ candidates
* as the join keys are already sorted, we can run uniq
```
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst
real 92m28.442s
user 142m49.627s
sys 46m9.473s
```
Some manual sampling:
Different release, but same references (585):
* https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references
* https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references
There are duplicates in the join, need to filter them out.
```
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst
```
Left with about 13B uniq.
OCI, example:
* https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220
* OCI: 646 citations
we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss?
However, we do have all but one of the OCI DOIs in fatcat:
```
$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json
```
Example, DOI not in OCI:
* https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30
Possible mitigations:
* ignore common titles
* ignore numbers only
Examples: `42` appeards 3816 times
Harder cases:
* "41st annual meeting" - too generic, and wrong
Generic DOI lookup from OCI in fatcat:
```
$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc .
{"doi":"10.1530/erc-16-0228","status":200}
{"doi":"10.1371/journal.pone.0080023","status":200}
{"doi":"10.1074/jbc.m114.566141","status":200}
...
```
Overall:
* 31344136 unique titles
most common join title:
* 11,939,631,644 introduction
* also: "science", "preface", "book reviews", ..., "cell", ...
Filtering:
```
$ zstdcat -T0 title_counts.tsv.zst | \
LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)'
```
About 7275 titles to filter out, e.g.
```
...
475300 abstracts of papers
20502 ac
13892 aca
7881 academic freedom
...
5047 community policing
157176 community-acquired pneumonia
68222 commutative algebra
5512 comorbidity
5516 compact stars
8865 company
...
7353 facebook
6461 facial pain
8977 facilities
5238 facing the future
5064 fact
11198 fact sheet
...
```
Trying fuzzycat clustering, with 0.1.13, which allows to compress `-C`
intermediate artifacts.
```
$ time zstdcat \
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \
fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst
```
Using fuzzycat 0.1.13 with compression; all fine until:
```
$ time zstdcat \
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \
-l | parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python \
-m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst
1.58G 6:35:39 [66.5k/s] [ <=> ]
parallel: Error: Output is incomplete.
parallel: Error: Cannot append to buffer file in /tmp.
parallel: Error: Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
real 1013m20.128s
user 2696m14.290s
sys 119m29.419s
```
A run with `--compress` and `--tmpdir` set on parallel worked:
```
$ time zstdcat
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst |
parallel --compress --tmpdir /fast/tmp -j 4 --block 10M --roundrobin --pipe
'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' |
zstd -T0 -c > cluster.ndj.zst
real 1301m26.206s
user 2778m20.635s
sys 140m32.121s
```
* 21h, finds 5850385 clusters (seems too low)
# Sample generation
Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases:
* ~114M refs
* ~7M releases
Adjusted `tasks.py` to use a different sha1 and updated settings.ini with
sample file locations.
# First clustering
Key extraction (KE), sorting and clustering took 14h, when the merged dataset
is already there (it takes ~80min to convert refs to releases, plus a bit more
to concatenate the files).
```
$ ./run.sh RefsFatcatClusters
real 841m45.169s
user 2872m35.481s
sys 561m14.231s
```
Resulting file is 154G compressed.
Cluster count and sizes:
```
$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv
```
Follow up tasks:
* each cluster will have ref and non-ref items
* we want at least one non-ref item
```
$ skate-cluster -both ...
```
Will keep only those clusters that contain at least one ref and one non-ref
entry.
Found 40257623 clusters, iteration over the 89GB compressed file takes 28min.
Raw synopsis:
```
$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r
```
Some numbers:
* [ ] number of 2-clusters, where not both entries have a doi?
Verification.
* needed a different batch verifier, since we do not need pairwise comparisons;
```
$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr
8390899 Status.DIFFERENT Reason.YEAR
6191622 Status.EXACT Reason.DOI
5468805 Status.STRONG Reason.JACCARD_AUTHORS
3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
1263329 Status.STRONG Reason.TOKENIZED_AUTHORS
424441 Status.AMBIGUOUS Reason.UNKNOWN
199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH
138144 Status.AMBIGUOUS Reason.SHORT_TITLE
92054 Status.DIFFERENT Reason.PAGE_COUNT
25122 Status.AMBIGUOUS Reason.BLACKLISTED
22964 Status.EXACT Reason.WORK_ID
17702 Status.STRONG Reason.VERSIONED_DOI
16236 Status.DIFFERENT Reason.COMPONENT
14462 Status.STRONG Reason.PREPRINT_PUBLISHED
9632 Status.STRONG Reason.PMID_DOI_PAIR
3429 Status.STRONG Reason.ARXIV_VERSION
3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
195 Status.STRONG Reason.FIGSHARE_VERSION
76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
74 Status.DIFFERENT Reason.TITLE_FILENAME
43 Status.DIFFERENT Reason.NUM_DIFF
22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
1 Status.STRONG Reason.CUSTOM_BSI_UNDATED
```
Guessing: Maybe 30% "strong", so maybe ~120M new edges?
----
# Manual sampling and issues
```
https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR
```
Grobid output:
```xml
<biblStruct xml:id="b77">
<analytic>
<title level="a" type="main">The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach</title>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">L</forename><surname>Servillo</surname></persName>
</author>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Van Den</surname></persName>
</author>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">P</forename><surname>Broeck</surname></persName>
</author>
<idno type="DOI">10.1080/02697459.2012.661179></idno>
<idno>En línea] 2012 [Fecha de consulta: 21 de agosto 2015</idno>
<ptr target="<http://dx.doi.org/10.1080/02697459.2012.661179>" />
</analytic>
<monogr>
<title level="j">En: Planning Practice and Research</title>
<imprint>
<biblScope unit="volume">27</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="41" to="61" />
</imprint>
</monogr>
</biblStruct>
```
There are dates, but not explicit clean 2012.
Another issue:
```
https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
```
Very similar titles:
"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ...
* year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs)
Intermediate match results:
```
141970958 Status.DIFFERENT Reason.YEAR
106734288 Status.EXACT Reason.DOI
91205561 Status.STRONG Reason.JACCARD_AUTHORS
66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
20889423 Status.STRONG Reason.TOKENIZED_AUTHORS
7449880 Status.AMBIGUOUS Reason.UNKNOWN
3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH
1199761 Status.DIFFERENT Reason.PAGE_COUNT
1121611 Status.AMBIGUOUS Reason.SHORT_TITLE
395710 Status.EXACT Reason.WORK_ID
362089 Status.DIFFERENT Reason.COMPONENT
351654 Status.AMBIGUOUS Reason.BLACKLISTED
326730 Status.STRONG Reason.VERSIONED_DOI
239924 Status.STRONG Reason.PREPRINT_PUBLISHED
171594 Status.STRONG Reason.PMID_DOI_PAIR
54646 Status.STRONG Reason.ARXIV_VERSION
49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
5219 Status.DIFFERENT Reason.TITLE_FILENAME
2451 Status.AMBIGUOUS Reason.APPENDIX
1874 Status.STRONG Reason.FIGSHARE_VERSION
1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
774 Status.DIFFERENT Reason.NUM_DIFF
448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
17 Status.STRONG Reason.CUSTOM_BSI_UNDATED
17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
```
Another false negative:
* https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a
* http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji
```
https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR
```
Both docs contain 1972?
```xml
<biblStruct xml:id="b67">
<analytic>
<title level="a" type="main">Variational Wavefunctions for H2 +</title>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">F</forename><surname>Weinhold</surname></persName>
</author>
</analytic>
<monogr>
<title level="j">J. Chem. Phys</title>
<imprint>
<biblScope unit="volume">56</biblScope>
<biblScope unit="page" from="3798" to="3801" />
<date type="published" when="1972" />
</imprint>
</monogr>
</biblStruct>
```
----
Running:
```
$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus
ter_ref_verify.tsv
```
resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons.
Stats:
```
$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0
cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp
LC_ALL=C sort -S20% | uniq -c | sort -nr
146095427 Status.DIFFERENT Reason.YEAR
110052214 Status.EXACT Reason.DOI
94300998 Status.STRONG Reason.JACCARD_AUTHORS
68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
21545821 Status.STRONG Reason.TOKENIZED_AUTHORS
7746937 Status.AMBIGUOUS Reason.UNKNOWN
3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH
1265506 Status.DIFFERENT Reason.PAGE_COUNT
1171178 Status.AMBIGUOUS Reason.SHORT_TITLE
409043 Status.EXACT Reason.WORK_ID
374051 Status.DIFFERENT Reason.COMPONENT
356772 Status.AMBIGUOUS Reason.BLACKLISTED
336588 Status.STRONG Reason.VERSIONED_DOI
249723 Status.STRONG Reason.PREPRINT_PUBLISHED
177547 Status.STRONG Reason.PMID_DOI_PAIR
56445 Status.STRONG Reason.ARXIV_VERSION
51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
5255 Status.DIFFERENT Reason.TITLE_FILENAME
2451 Status.AMBIGUOUS Reason.APPENDIX
1946 Status.STRONG Reason.FIGSHARE_VERSION
1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
798 Status.DIFFERENT Reason.NUM_DIFF
463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
18 Status.STRONG Reason.CUSTOM_BSI_UNDATED
18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
```
286M positive links.
```
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc
286008492
```
Or 175M, if we exclude DOI and work matches.
```
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc
175547235
```
----
The final derivation dep tree looks like:
```
$ ./tasks.py -d BiblioRef
\_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
```
|