aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/version_1.md
blob: 50a38cc7e053ed6f509b7c7d83c7c91e9ff11098 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
# Version 1

Includes:

* doi, pmid, pmcid, arxiv
* title-lower exact matches

Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g.
"introduction". 180G compressed, about 53 min for a one pass.

```
$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \
    <(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst
```

Filter and sample with `awk`, e.g. via:

```
$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0'
```

Need to pre-filter before join, to keep join smaller.

Basic inspection of the "exact lower title" set.

* 16B+ candidates
* as the join keys are already sorted, we can run uniq

```
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst

real    92m28.442s
user    142m49.627s
sys     46m9.473s
```

Some manual sampling:

Different release, but same references (585):

* https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references
* https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references

There are duplicates in the join, need to filter them out.

```
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst
```

Left with about 13B uniq.

OCI, example:

* https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220
* OCI: 646 citations

we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss?

However, we do have all but one of the OCI DOIs in fatcat:

```
$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json
```

Example, DOI not in OCI:

* https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30

Possible mitigations:

* ignore common titles
* ignore numbers only

Examples: `42` appeards 3816 times

Harder cases:

* "41st annual meeting" - too generic, and wrong


Generic DOI lookup from OCI in fatcat:

```
$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc .
{"doi":"10.1530/erc-16-0228","status":200}
{"doi":"10.1371/journal.pone.0080023","status":200}
{"doi":"10.1074/jbc.m114.566141","status":200}
...
```

Overall:

* 31344136 unique titles

most common join title:

* 11,939,631,644 introduction
* also: "science", "preface", "book reviews", ..., "cell", ...

Filtering:

```
$ zstdcat -T0 title_counts.tsv.zst | \
    LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)'
```

About 7275 titles to filter out, e.g.

```
...
 475300 abstracts of papers
  20502 ac
  13892 aca
   7881 academic freedom
...
   5047 community policing
 157176 community-acquired pneumonia
  68222 commutative algebra
   5512 comorbidity
   5516 compact stars
   8865 company
...
   7353 facebook
   6461 facial pain
   8977 facilities
   5238 facing the future
   5064 fact
  11198 fact sheet
...
```

Trying fuzzycat clustering, with 0.1.13, which allows to compress `-C`
intermediate artifacts.

```
$ time zstdcat \
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    parallel -j 16 --block 10M  --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \
    fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst
```

Using fuzzycat 0.1.13 with compression; all fine until:

```
$ time zstdcat \
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \
    -l | parallel -j 16 --block 10M  --roundrobin --pipe 'TMPDIR=/bigger/tmp python \
    -m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst

1.58G 6:35:39 [66.5k/s] [                                                                                                                 <=>                                                                                                 ]
parallel: Error: Output is incomplete.
parallel: Error: Cannot append to buffer file in /tmp.
parallel: Error: Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.

real    1013m20.128s
user    2696m14.290s
sys     119m29.419s
```

A run with `--compress` and `--tmpdir` set on parallel worked:

```
$ time zstdcat
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst |
    parallel --compress --tmpdir /fast/tmp -j 4 --block 10M  --roundrobin --pipe
    'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' |
    zstd -T0 -c > cluster.ndj.zst

real    1301m26.206s
user    2778m20.635s
sys     140m32.121s
```

* 21h, finds 5850385 clusters (seems too low)

# Sample generation

Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases:

* ~114M refs
* ~7M releases

Adjusted `tasks.py` to use a different sha1 and updated settings.ini with
sample file locations.

# First clustering

Key extraction (KE), sorting and clustering took 14h, when the merged dataset
is already there (it takes ~80min to convert refs to releases, plus a bit more
to concatenate the files).

```
$ ./run.sh RefsFatcatClusters

real    841m45.169s
user    2872m35.481s
sys     561m14.231s
```

Resulting file is 154G compressed.

Cluster count and sizes:

```
$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv
```

Follow up tasks:

* each cluster will have ref and non-ref items
* we want at least one non-ref item

```
$ skate-cluster -both ...
```

Will keep only those clusters that contain at least one ref and one non-ref
entry.

Found 40257623 clusters, iteration over the 89GB compressed file takes 28min.

Raw synopsis:

```
$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r
```

Some numbers:

* [ ] number of 2-clusters, where not both entries have a doi?

Verification.

* needed a different batch verifier, since we do not need pairwise comparisons;

```
$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr
8390899 Status.DIFFERENT Reason.YEAR
6191622 Status.EXACT Reason.DOI
5468805 Status.STRONG Reason.JACCARD_AUTHORS
3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
1263329 Status.STRONG Reason.TOKENIZED_AUTHORS
 424441 Status.AMBIGUOUS Reason.UNKNOWN
 199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH
 138144 Status.AMBIGUOUS Reason.SHORT_TITLE
  92054 Status.DIFFERENT Reason.PAGE_COUNT
  25122 Status.AMBIGUOUS Reason.BLACKLISTED
  22964 Status.EXACT Reason.WORK_ID
  17702 Status.STRONG Reason.VERSIONED_DOI
  16236 Status.DIFFERENT Reason.COMPONENT
  14462 Status.STRONG Reason.PREPRINT_PUBLISHED
   9632 Status.STRONG Reason.PMID_DOI_PAIR
   3429 Status.STRONG Reason.ARXIV_VERSION
   3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
    195 Status.STRONG Reason.FIGSHARE_VERSION
     76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
     74 Status.DIFFERENT Reason.TITLE_FILENAME
     43 Status.DIFFERENT Reason.NUM_DIFF
     22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
     11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
      1 Status.STRONG Reason.CUSTOM_BSI_UNDATED
```

Guessing: Maybe 30% "strong", so maybe ~120M new edges?


----

# Manual sampling and issues

```
https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR
```

Grobid output:

```xml
<biblStruct xml:id="b77">
        <analytic>
                <title level="a" type="main">The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach</title>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">L</forename><surname>Servillo</surname></persName>
                </author>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Van Den</surname></persName>
                </author>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">P</forename><surname>Broeck</surname></persName>
                </author>
                <idno type="DOI">10.1080/02697459.2012.661179&gt;</idno>
                <idno>En línea] 2012 [Fecha de consulta: 21 de agosto 2015</idno>
                <ptr target="&lt;http://dx.doi.org/10.1080/02697459.2012.661179&gt;" />
        </analytic>
        <monogr>
                <title level="j">En: Planning Practice and Research</title>
                <imprint>
                        <biblScope unit="volume">27</biblScope>
                        <biblScope unit="issue">1</biblScope>
                        <biblScope unit="page" from="41" to="61" />
                </imprint>
        </monogr>
</biblStruct>
```

There are dates, but not explicit clean 2012.

Another issue:

```
https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
```

Very similar titles:

"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ...

* year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs)

Intermediate match results:

```
141970958 Status.DIFFERENT Reason.YEAR
106734288 Status.EXACT Reason.DOI
 91205561 Status.STRONG Reason.JACCARD_AUTHORS
 66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
 53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
 20889423 Status.STRONG Reason.TOKENIZED_AUTHORS
  7449880 Status.AMBIGUOUS Reason.UNKNOWN
  3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH
  1199761 Status.DIFFERENT Reason.PAGE_COUNT
  1121611 Status.AMBIGUOUS Reason.SHORT_TITLE
   395710 Status.EXACT Reason.WORK_ID
   362089 Status.DIFFERENT Reason.COMPONENT
   351654 Status.AMBIGUOUS Reason.BLACKLISTED
   326730 Status.STRONG Reason.VERSIONED_DOI
   239924 Status.STRONG Reason.PREPRINT_PUBLISHED
   171594 Status.STRONG Reason.PMID_DOI_PAIR
    54646 Status.STRONG Reason.ARXIV_VERSION
    49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
     5219 Status.DIFFERENT Reason.TITLE_FILENAME
     2451 Status.AMBIGUOUS Reason.APPENDIX
     1874 Status.STRONG Reason.FIGSHARE_VERSION
     1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
      774 Status.DIFFERENT Reason.NUM_DIFF
      448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
      123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
       17 Status.STRONG Reason.CUSTOM_BSI_UNDATED
       17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
        6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
```

Another false negative:

* https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a
* http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji

```
https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR
```

Both docs contain 1972?

```xml
<biblStruct xml:id="b67">
        <analytic>
                <title level="a" type="main">Variational Wavefunctions for H2 +</title>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">F</forename><surname>Weinhold</surname></persName>
                </author>
        </analytic>
        <monogr>
                <title level="j">J. Chem. Phys</title>
                <imprint>
                        <biblScope unit="volume">56</biblScope>
                        <biblScope unit="page" from="3798" to="3801" />
                        <date type="published" when="1972" />
                </imprint>
        </monogr>
</biblStruct>
```

----

Running:

```
$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus
ter_ref_verify.tsv
```

resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons.

Stats:

```
$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0
    cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp
    LC_ALL=C sort -S20% | uniq -c | sort -nr

146095427 Status.DIFFERENT Reason.YEAR
110052214 Status.EXACT Reason.DOI
 94300998 Status.STRONG Reason.JACCARD_AUTHORS
 68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
 55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
 21545821 Status.STRONG Reason.TOKENIZED_AUTHORS
  7746937 Status.AMBIGUOUS Reason.UNKNOWN
  3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH
  1265506 Status.DIFFERENT Reason.PAGE_COUNT
  1171178 Status.AMBIGUOUS Reason.SHORT_TITLE
   409043 Status.EXACT Reason.WORK_ID
   374051 Status.DIFFERENT Reason.COMPONENT
   356772 Status.AMBIGUOUS Reason.BLACKLISTED
   336588 Status.STRONG Reason.VERSIONED_DOI
   249723 Status.STRONG Reason.PREPRINT_PUBLISHED
   177547 Status.STRONG Reason.PMID_DOI_PAIR
    56445 Status.STRONG Reason.ARXIV_VERSION
    51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
     5255 Status.DIFFERENT Reason.TITLE_FILENAME
     2451 Status.AMBIGUOUS Reason.APPENDIX
     1946 Status.STRONG Reason.FIGSHARE_VERSION
     1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
      798 Status.DIFFERENT Reason.NUM_DIFF
      463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
      125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
       18 Status.STRONG Reason.CUSTOM_BSI_UNDATED
       18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
        7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC

```

286M positive links.

```
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc
286008492
```

Or 175M, if we exclude DOI and work matches.

```
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc
175547235
```

----

The final derivation dep tree looks like:

```
 $ ./tasks.py -d BiblioRef
 \_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
    \_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
       \_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
          \_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ ReleaseExportExpanded()
                   \_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                      \_ Input()
    \_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
       \_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
          \_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
             \_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
          \_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
             \_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
          \_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
             \_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
          \_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ ReleaseExportExpanded()
             \_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ Input()
```