# Version 1

Includes:

* doi, pmid, pmcid, arxiv
* title-lower exact matches

Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g.
"introduction". 180G compressed, about 53 min for a one pass.

```
$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \
    <(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst
```

Filter and sample with `awk`, e.g. via:

```
$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0'
```

Need to pre-filter before join, to keep join smaller.

Basic inspection of the "exact lower title" set.

* 16B+ candidates
* as the join keys are already sorted, we can run uniq

```
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst

real    92m28.442s
user    142m49.627s
sys     46m9.473s
```

Some manual sampling:

Different release, but same references (585):

* https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references
* https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references

There are duplicates in the join, need to filter them out.

```
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst
```

Left with about 13B uniq.

OCI, example:

* https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220
* OCI: 646 citations

we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss?

However, we do have all but one of the OCI DOIs in fatcat:

```
$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json
```

Example, DOI not in OCI:

* https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30

Possible mitigations:

* ignore common titles
* ignore numbers only

Examples: `42` appeards 3816 times

Harder cases:

* "41st annual meeting" - too generic, and wrong


Generic DOI lookup from OCI in fatcat:

```
$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc .
{"doi":"10.1530/erc-16-0228","status":200}
{"doi":"10.1371/journal.pone.0080023","status":200}
{"doi":"10.1074/jbc.m114.566141","status":200}
...
```

Overall:

* 31344136 unique titles

most common join title:

* 11,939,631,644 introduction
* also: "science", "preface", "book reviews", ..., "cell", ...

Filtering:

```
$ zstdcat -T0 title_counts.tsv.zst | \
    LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)'
```

About 7275 titles to filter out, e.g.

```
...
 475300 abstracts of papers
  20502 ac
  13892 aca
   7881 academic freedom
...
   5047 community policing
 157176 community-acquired pneumonia
  68222 commutative algebra
   5512 comorbidity
   5516 compact stars
   8865 company
...
   7353 facebook
   6461 facial pain
   8977 facilities
   5238 facing the future
   5064 fact
  11198 fact sheet
...
```

Trying fuzzycat clustering, with 0.1.13, which allows to compress `-C`
intermediate artifacts.

```
$ time zstdcat \
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    parallel -j 16 --block 10M  --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \
    fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst
```

Using fuzzycat 0.1.13 with compression; all fine until:

```
$ time zstdcat \
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \
    -l | parallel -j 16 --block 10M  --roundrobin --pipe 'TMPDIR=/bigger/tmp python \
    -m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst

1.58G 6:35:39 [66.5k/s] [                                                                                                                 <=>                                                                                                 ]
parallel: Error: Output is incomplete.
parallel: Error: Cannot append to buffer file in /tmp.
parallel: Error: Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.

real    1013m20.128s
user    2696m14.290s
sys     119m29.419s
```

A run with `--compress` and `--tmpdir` set on parallel worked:

```
$ time zstdcat
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst |
    parallel --compress --tmpdir /fast/tmp -j 4 --block 10M  --roundrobin --pipe
    'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' |
    zstd -T0 -c > cluster.ndj.zst

real    1301m26.206s
user    2778m20.635s
sys     140m32.121s
```

* 21h, finds 5850385 clusters (seems too low)

# Sample generation

Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases:

* ~114M refs
* ~7M releases

Adjusted `tasks.py` to use a different sha1 and updated settings.ini with
sample file locations.

# First clustering

Key extraction (KE), sorting and clustering took 14h, when the merged dataset
is already there (it takes ~80min to convert refs to releases, plus a bit more
to concatenate the files).

```
$ ./run.sh RefsFatcatClusters

real    841m45.169s
user    2872m35.481s
sys     561m14.231s
```

Resulting file is 154G compressed.

Cluster count and sizes:

```
$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv
```

Follow up tasks:

* each cluster will have ref and non-ref items
* we want at least one non-ref item

```
$ skate-cluster -both ...
```

Will keep only those clusters that contain at least one ref and one non-ref
entry.

Found 40257623 clusters, iteration over the 89GB compressed file takes 28min.

Raw synopsis:

```
$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r
```

Some numbers:

* [ ] number of 2-clusters, where not both entries have a doi?

Verification.

* needed a different batch verifier, since we do not need pairwise comparisons;

```
$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr
8390899 Status.DIFFERENT Reason.YEAR
6191622 Status.EXACT Reason.DOI
5468805 Status.STRONG Reason.JACCARD_AUTHORS
3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
1263329 Status.STRONG Reason.TOKENIZED_AUTHORS
 424441 Status.AMBIGUOUS Reason.UNKNOWN
 199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH
 138144 Status.AMBIGUOUS Reason.SHORT_TITLE
  92054 Status.DIFFERENT Reason.PAGE_COUNT
  25122 Status.AMBIGUOUS Reason.BLACKLISTED
  22964 Status.EXACT Reason.WORK_ID
  17702 Status.STRONG Reason.VERSIONED_DOI
  16236 Status.DIFFERENT Reason.COMPONENT
  14462 Status.STRONG Reason.PREPRINT_PUBLISHED
   9632 Status.STRONG Reason.PMID_DOI_PAIR
   3429 Status.STRONG Reason.ARXIV_VERSION
   3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
    195 Status.STRONG Reason.FIGSHARE_VERSION
     76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
     74 Status.DIFFERENT Reason.TITLE_FILENAME
     43 Status.DIFFERENT Reason.NUM_DIFF
     22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
     11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
      1 Status.STRONG Reason.CUSTOM_BSI_UNDATED
```

Guessing: Maybe 30% "strong", so maybe ~120M new edges?


----

# Manual sampling and issues

```
https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR
```

Grobid output:

```xml
<biblStruct xml:id="b77">
        <analytic>
                <title level="a" type="main">The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach</title>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">L</forename><surname>Servillo</surname></persName>
                </author>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Van Den</surname></persName>
                </author>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">P</forename><surname>Broeck</surname></persName>
                </author>
                <idno type="DOI">10.1080/02697459.2012.661179&gt;</idno>
                <idno>En lĂ­nea] 2012 [Fecha de consulta: 21 de agosto 2015</idno>
                <ptr target="&lt;http://dx.doi.org/10.1080/02697459.2012.661179&gt;" />
        </analytic>
        <monogr>
                <title level="j">En: Planning Practice and Research</title>
                <imprint>
                        <biblScope unit="volume">27</biblScope>
                        <biblScope unit="issue">1</biblScope>
                        <biblScope unit="page" from="41" to="61" />
                </imprint>
        </monogr>
</biblStruct>
```

There are dates, but not explicit clean 2012.

Another issue:

```
https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
```

Very similar titles:

"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ...

* year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs)

Intermediate match results:

```
141970958 Status.DIFFERENT Reason.YEAR
106734288 Status.EXACT Reason.DOI
 91205561 Status.STRONG Reason.JACCARD_AUTHORS
 66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
 53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
 20889423 Status.STRONG Reason.TOKENIZED_AUTHORS
  7449880 Status.AMBIGUOUS Reason.UNKNOWN
  3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH
  1199761 Status.DIFFERENT Reason.PAGE_COUNT
  1121611 Status.AMBIGUOUS Reason.SHORT_TITLE
   395710 Status.EXACT Reason.WORK_ID
   362089 Status.DIFFERENT Reason.COMPONENT
   351654 Status.AMBIGUOUS Reason.BLACKLISTED
   326730 Status.STRONG Reason.VERSIONED_DOI
   239924 Status.STRONG Reason.PREPRINT_PUBLISHED
   171594 Status.STRONG Reason.PMID_DOI_PAIR
    54646 Status.STRONG Reason.ARXIV_VERSION
    49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
     5219 Status.DIFFERENT Reason.TITLE_FILENAME
     2451 Status.AMBIGUOUS Reason.APPENDIX
     1874 Status.STRONG Reason.FIGSHARE_VERSION
     1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
      774 Status.DIFFERENT Reason.NUM_DIFF
      448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
      123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
       17 Status.STRONG Reason.CUSTOM_BSI_UNDATED
       17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
        6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
```

Another false negative:

* https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a
* http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji

```
https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR
```

Both docs contain 1972?

```xml
<biblStruct xml:id="b67">
        <analytic>
                <title level="a" type="main">Variational Wavefunctions for H2 +</title>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">F</forename><surname>Weinhold</surname></persName>
                </author>
        </analytic>
        <monogr>
                <title level="j">J. Chem. Phys</title>
                <imprint>
                        <biblScope unit="volume">56</biblScope>
                        <biblScope unit="page" from="3798" to="3801" />
                        <date type="published" when="1972" />
                </imprint>
        </monogr>
</biblStruct>
```

----

Running:

```
$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus
ter_ref_verify.tsv
```

resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons.

Stats:

```
$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0
    cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp
    LC_ALL=C sort -S20% | uniq -c | sort -nr

146095427 Status.DIFFERENT Reason.YEAR
110052214 Status.EXACT Reason.DOI
 94300998 Status.STRONG Reason.JACCARD_AUTHORS
 68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
 55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
 21545821 Status.STRONG Reason.TOKENIZED_AUTHORS
  7746937 Status.AMBIGUOUS Reason.UNKNOWN
  3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH
  1265506 Status.DIFFERENT Reason.PAGE_COUNT
  1171178 Status.AMBIGUOUS Reason.SHORT_TITLE
   409043 Status.EXACT Reason.WORK_ID
   374051 Status.DIFFERENT Reason.COMPONENT
   356772 Status.AMBIGUOUS Reason.BLACKLISTED
   336588 Status.STRONG Reason.VERSIONED_DOI
   249723 Status.STRONG Reason.PREPRINT_PUBLISHED
   177547 Status.STRONG Reason.PMID_DOI_PAIR
    56445 Status.STRONG Reason.ARXIV_VERSION
    51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
     5255 Status.DIFFERENT Reason.TITLE_FILENAME
     2451 Status.AMBIGUOUS Reason.APPENDIX
     1946 Status.STRONG Reason.FIGSHARE_VERSION
     1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
      798 Status.DIFFERENT Reason.NUM_DIFF
      463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
      125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
       18 Status.STRONG Reason.CUSTOM_BSI_UNDATED
       18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
        7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC

```

286M positive links.

```
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc
286008492
```

Or 175M, if we exclude DOI and work matches.

```
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc
175547235
```

----

The final derivation dep tree looks like:

```
 $ ./tasks.py -d BiblioRef
 \_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
    \_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
       \_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
          \_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ ReleaseExportExpanded()
                   \_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                      \_ Input()
    \_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
       \_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
          \_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
             \_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
          \_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
             \_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
          \_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
             \_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
          \_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ ReleaseExportExpanded()
             \_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ Input()
```