# Version 1
Includes:
* doi, pmid, pmcid, arxiv
* title-lower exact matches
Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g.
"introduction". 180G compressed, about 53 min for a one pass.
```
$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \
<(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst
```
Filter and sample with `awk`, e.g. via:
```
$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0'
```
Need to pre-filter before join, to keep join smaller.
Basic inspection of the "exact lower title" set.
* 16B+ candidates
* as the join keys are already sorted, we can run uniq
```
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst
real 92m28.442s
user 142m49.627s
sys 46m9.473s
```
Some manual sampling:
Different release, but same references (585):
* https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references
* https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references
There are duplicates in the join, need to filter them out.
```
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst
```
Left with about 13B uniq.
OCI, example:
* https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220
* OCI: 646 citations
we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss?
However, we do have all but one of the OCI DOIs in fatcat:
```
$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json
```
Example, DOI not in OCI:
* https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30
Possible mitigations:
* ignore common titles
* ignore numbers only
Examples: `42` appeards 3816 times
Harder cases:
* "41st annual meeting" - too generic, and wrong
Generic DOI lookup from OCI in fatcat:
```
$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc .
{"doi":"10.1530/erc-16-0228","status":200}
{"doi":"10.1371/journal.pone.0080023","status":200}
{"doi":"10.1074/jbc.m114.566141","status":200}
...
```
Overall:
* 31344136 unique titles
most common join title:
* 11,939,631,644 introduction
* also: "science", "preface", "book reviews", ..., "cell", ...
Filtering:
```
$ zstdcat -T0 title_counts.tsv.zst | \
LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)'
```
About 7275 titles to filter out, e.g.
```
...
475300 abstracts of papers
20502 ac
13892 aca
7881 academic freedom
...
5047 community policing
157176 community-acquired pneumonia
68222 commutative algebra
5512 comorbidity
5516 compact stars
8865 company
...
7353 facebook
6461 facial pain
8977 facilities
5238 facing the future
5064 fact
11198 fact sheet
...
```
Trying fuzzycat clustering, with 0.1.13, which allows to compress `-C`
intermediate artifacts.
```
$ time zstdcat \
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \
fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst
```
Using fuzzycat 0.1.13 with compression; all fine until:
```
$ time zstdcat \
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \
-l | parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python \
-m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst
1.58G 6:35:39 [66.5k/s] [ <=> ]
parallel: Error: Output is incomplete.
parallel: Error: Cannot append to buffer file in /tmp.
parallel: Error: Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
real 1013m20.128s
user 2696m14.290s
sys 119m29.419s
```
A run with `--compress` and `--tmpdir` set on parallel worked:
```
$ time zstdcat
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst |
parallel --compress --tmpdir /fast/tmp -j 4 --block 10M --roundrobin --pipe
'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' |
zstd -T0 -c > cluster.ndj.zst
real 1301m26.206s
user 2778m20.635s
sys 140m32.121s
```
* 21h, finds 5850385 clusters (seems too low)
# Sample generation
Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases:
* ~114M refs
* ~7M releases
Adjusted `tasks.py` to use a different sha1 and updated settings.ini with
sample file locations.
# First clustering
Key extraction (KE), sorting and clustering took 14h, when the merged dataset
is already there (it takes ~80min to convert refs to releases, plus a bit more
to concatenate the files).
```
$ ./run.sh RefsFatcatClusters
real 841m45.169s
user 2872m35.481s
sys 561m14.231s
```
Resulting file is 154G compressed.
Cluster count and sizes:
```
$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv
```
Follow up tasks:
* each cluster will have ref and non-ref items
* we want at least one non-ref item
```
$ skate-cluster -both ...
```
Will keep only those clusters that contain at least one ref and one non-ref
entry.
Found 40257623 clusters, iteration over the 89GB compressed file takes 28min.
Raw synopsis:
```
$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r
```
Some numbers:
* [ ] number of 2-clusters, where not both entries have a doi?
Verification.
* needed a different batch verifier, since we do not need pairwise comparisons;
```
$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr
8390899 Status.DIFFERENT Reason.YEAR
6191622 Status.EXACT Reason.DOI
5468805 Status.STRONG Reason.JACCARD_AUTHORS
3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
1263329 Status.STRONG Reason.TOKENIZED_AUTHORS
424441 Status.AMBIGUOUS Reason.UNKNOWN
199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH
138144 Status.AMBIGUOUS Reason.SHORT_TITLE
92054 Status.DIFFERENT Reason.PAGE_COUNT
25122 Status.AMBIGUOUS Reason.BLACKLISTED
22964 Status.EXACT Reason.WORK_ID
17702 Status.STRONG Reason.VERSIONED_DOI
16236 Status.DIFFERENT Reason.COMPONENT
14462 Status.STRONG Reason.PREPRINT_PUBLISHED
9632 Status.STRONG Reason.PMID_DOI_PAIR
3429 Status.STRONG Reason.ARXIV_VERSION
3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
195 Status.STRONG Reason.FIGSHARE_VERSION
76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
74 Status.DIFFERENT Reason.TITLE_FILENAME
43 Status.DIFFERENT Reason.NUM_DIFF
22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
1 Status.STRONG Reason.CUSTOM_BSI_UNDATED
```
Guessing: Maybe 30% "strong", so maybe ~120M new edges?
----
# Manual sampling and issues
```
https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR
```
Grobid output:
```xml
The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach
LServillo
Van Den
PBroeck
10.1080/02697459.2012.661179>
En lĂnea] 2012 [Fecha de consulta: 21 de agosto 2015
En: Planning Practice and Research
27
1
```
There are dates, but not explicit clean 2012.
Another issue:
```
https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
```
Very similar titles:
"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ...
* year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs)
Intermediate match results:
```
141970958 Status.DIFFERENT Reason.YEAR
106734288 Status.EXACT Reason.DOI
91205561 Status.STRONG Reason.JACCARD_AUTHORS
66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
20889423 Status.STRONG Reason.TOKENIZED_AUTHORS
7449880 Status.AMBIGUOUS Reason.UNKNOWN
3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH
1199761 Status.DIFFERENT Reason.PAGE_COUNT
1121611 Status.AMBIGUOUS Reason.SHORT_TITLE
395710 Status.EXACT Reason.WORK_ID
362089 Status.DIFFERENT Reason.COMPONENT
351654 Status.AMBIGUOUS Reason.BLACKLISTED
326730 Status.STRONG Reason.VERSIONED_DOI
239924 Status.STRONG Reason.PREPRINT_PUBLISHED
171594 Status.STRONG Reason.PMID_DOI_PAIR
54646 Status.STRONG Reason.ARXIV_VERSION
49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
5219 Status.DIFFERENT Reason.TITLE_FILENAME
2451 Status.AMBIGUOUS Reason.APPENDIX
1874 Status.STRONG Reason.FIGSHARE_VERSION
1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
774 Status.DIFFERENT Reason.NUM_DIFF
448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
17 Status.STRONG Reason.CUSTOM_BSI_UNDATED
17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
```
Another false negative:
* https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a
* http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji
```
https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR
```
Both docs contain 1972?
```xml
Variational Wavefunctions for H2 +
FWeinhold
J. Chem. Phys
56
```
----
Running:
```
$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus
ter_ref_verify.tsv
```
resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons.
Stats:
```
$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0
cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp
LC_ALL=C sort -S20% | uniq -c | sort -nr
146095427 Status.DIFFERENT Reason.YEAR
110052214 Status.EXACT Reason.DOI
94300998 Status.STRONG Reason.JACCARD_AUTHORS
68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
21545821 Status.STRONG Reason.TOKENIZED_AUTHORS
7746937 Status.AMBIGUOUS Reason.UNKNOWN
3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH
1265506 Status.DIFFERENT Reason.PAGE_COUNT
1171178 Status.AMBIGUOUS Reason.SHORT_TITLE
409043 Status.EXACT Reason.WORK_ID
374051 Status.DIFFERENT Reason.COMPONENT
356772 Status.AMBIGUOUS Reason.BLACKLISTED
336588 Status.STRONG Reason.VERSIONED_DOI
249723 Status.STRONG Reason.PREPRINT_PUBLISHED
177547 Status.STRONG Reason.PMID_DOI_PAIR
56445 Status.STRONG Reason.ARXIV_VERSION
51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
5255 Status.DIFFERENT Reason.TITLE_FILENAME
2451 Status.AMBIGUOUS Reason.APPENDIX
1946 Status.STRONG Reason.FIGSHARE_VERSION
1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
798 Status.DIFFERENT Reason.NUM_DIFF
463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
18 Status.STRONG Reason.CUSTOM_BSI_UNDATED
18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
```
286M positive links.
```
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc
286008492
```
Or 175M, if we exclude DOI and work matches.
```
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc
175547235
```
----
The final derivation dep tree looks like:
```
$ ./tasks.py -d BiblioRef
\_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
```