1 files changed, 491 insertions, 0 deletions
diff --git a/python/notes/version_1.md b/python/notes/version_1.md
new file mode 100644
index 0000000..50a38cc
--- /dev/null
+++ b/python/notes/version_1.md
@@ -0,0 +1,491 @@
+# Version 1
+
+Includes:
+
+* doi, pmid, pmcid, arxiv
+* title-lower exact matches
+
+Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g.
+"introduction". 180G compressed, about 53 min for a one pass.
+
+```
+$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \
+    <(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst
+```
+
+Filter and sample with `awk`, e.g. via:
+
+```
+$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0'
+```
+
+Need to pre-filter before join, to keep join smaller.
+
+Basic inspection of the "exact lower title" set.
+
+* 16B+ candidates
+* as the join keys are already sorted, we can run uniq
+
+```
+$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst
+
+real    92m28.442s
+user    142m49.627s
+sys     46m9.473s
+```
+
+Some manual sampling:
+
+Different release, but same references (585):
+
+* https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references
+* https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references
+
+There are duplicates in the join, need to filter them out.
+
+```
+$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst
+```
+
+Left with about 13B uniq.
+
+OCI, example:
+
+* https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220
+* OCI: 646 citations
+
+we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss?
+
+However, we do have all but one of the OCI DOIs in fatcat:
+
+```
+$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json
+```
+
+Example, DOI not in OCI:
+
+* https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30
+
+Possible mitigations:
+
+* ignore common titles
+* ignore numbers only
+
+Examples: `42` appeards 3816 times
+
+Harder cases:
+
+* "41st annual meeting" - too generic, and wrong
+
+
+Generic DOI lookup from OCI in fatcat:
+
+```
+$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc .
+{"doi":"10.1530/erc-16-0228","status":200}
+{"doi":"10.1371/journal.pone.0080023","status":200}
+{"doi":"10.1074/jbc.m114.566141","status":200}
+...
+```
+
+Overall:
+
+* 31344136 unique titles
+
+most common join title:
+
+* 11,939,631,644 introduction
+* also: "science", "preface", "book reviews", ..., "cell", ...
+
+Filtering:
+
+```
+$ zstdcat -T0 title_counts.tsv.zst | \
+    LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)'
+```
+
+About 7275 titles to filter out, e.g.
+
+```
+...
+ 475300 abstracts of papers
+  20502 ac
+  13892 aca
+   7881 academic freedom
+...
+   5047 community policing
+ 157176 community-acquired pneumonia
+  68222 commutative algebra
+   5512 comorbidity
+   5516 compact stars
+   8865 company
+...
+   7353 facebook
+   6461 facial pain
+   8977 facilities
+   5238 facing the future
+   5064 fact
+  11198 fact sheet
+...
+```
+
+Trying fuzzycat clustering, with 0.1.13, which allows to compress `-C`
+intermediate artifacts.
+
+```
+$ time zstdcat \
+    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
+    parallel -j 16 --block 10M  --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \
+    fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst
+```
+
+Using fuzzycat 0.1.13 with compression; all fine until:
+
+```
+$ time zstdcat \
+    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \
+    -l | parallel -j 16 --block 10M  --roundrobin --pipe 'TMPDIR=/bigger/tmp python \
+    -m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst
+
+1.58G 6:35:39 [66.5k/s] [                                                                                                                 <=>                                                                                                 ]
+parallel: Error: Output is incomplete.
+parallel: Error: Cannot append to buffer file in /tmp.
+parallel: Error: Is the disk full?
+parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
+
+real    1013m20.128s
+user    2696m14.290s
+sys     119m29.419s
+```
+
+A run with `--compress` and `--tmpdir` set on parallel worked:
+
+```
+$ time zstdcat
+    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst |
+    parallel --compress --tmpdir /fast/tmp -j 4 --block 10M  --roundrobin --pipe
+    'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' |
+    zstd -T0 -c > cluster.ndj.zst
+
+real    1301m26.206s
+user    2778m20.635s
+sys     140m32.121s
+```
+
+* 21h, finds 5850385 clusters (seems too low)
+
+# Sample generation
+
+Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases:
+
+* ~114M refs
+* ~7M releases
+
+Adjusted `tasks.py` to use a different sha1 and updated settings.ini with
+sample file locations.
+
+# First clustering
+
+Key extraction (KE), sorting and clustering took 14h, when the merged dataset
+is already there (it takes ~80min to convert refs to releases, plus a bit more
+to concatenate the files).
+
+```
+$ ./run.sh RefsFatcatClusters
+
+real    841m45.169s
+user    2872m35.481s
+sys     561m14.231s
+```
+
+Resulting file is 154G compressed.
+
+Cluster count and sizes:
+
+```
+$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
+    LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv
+```
+
+Follow up tasks:
+
+* each cluster will have ref and non-ref items
+* we want at least one non-ref item
+
+```
+$ skate-cluster -both ...
+```
+
+Will keep only those clusters that contain at least one ref and one non-ref
+entry.
+
+Found 40257623 clusters, iteration over the 89GB compressed file takes 28min.
+
+Raw synopsis:
+
+```
+$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
+    jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r
+```
+
+Some numbers:
+
+* [ ] number of 2-clusters, where not both entries have a doi?
+
+Verification.
+
+* needed a different batch verifier, since we do not need pairwise comparisons;
+
+```
+$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr
+8390899 Status.DIFFERENT Reason.YEAR
+6191622 Status.EXACT Reason.DOI
+5468805 Status.STRONG Reason.JACCARD_AUTHORS
+3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
+3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
+1263329 Status.STRONG Reason.TOKENIZED_AUTHORS
+ 424441 Status.AMBIGUOUS Reason.UNKNOWN
+ 199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH
+ 138144 Status.AMBIGUOUS Reason.SHORT_TITLE
+  92054 Status.DIFFERENT Reason.PAGE_COUNT
+  25122 Status.AMBIGUOUS Reason.BLACKLISTED
+  22964 Status.EXACT Reason.WORK_ID
+  17702 Status.STRONG Reason.VERSIONED_DOI
+  16236 Status.DIFFERENT Reason.COMPONENT
+  14462 Status.STRONG Reason.PREPRINT_PUBLISHED
+   9632 Status.STRONG Reason.PMID_DOI_PAIR
+   3429 Status.STRONG Reason.ARXIV_VERSION
+   3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
+    729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
+    195 Status.STRONG Reason.FIGSHARE_VERSION
+     76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
+     74 Status.DIFFERENT Reason.TITLE_FILENAME
+     43 Status.DIFFERENT Reason.NUM_DIFF
+     22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
+     11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
+      1 Status.STRONG Reason.CUSTOM_BSI_UNDATED
+```
+
+Guessing: Maybe 30% "strong", so maybe ~120M new edges?
+
+
+----
+
+# Manual sampling and issues
+
+```
+https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR
+```
+
+Grobid output:
+
+```xml
+<biblStruct xml:id="b77">
+        <analytic>
+                <title level="a" type="main">The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach</title>
+                <author>
+                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">L</forename><surname>Servillo</surname></persName>
+                </author>
+                <author>
+                        <persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Van Den</surname></persName>
+                </author>
+                <author>
+                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">P</forename><surname>Broeck</surname></persName>
+                </author>
+                <idno type="DOI">10.1080/02697459.2012.661179&gt;</idno>
+                <idno>En línea] 2012 [Fecha de consulta: 21 de agosto 2015</idno>
+                <ptr target="&lt;http://dx.doi.org/10.1080/02697459.2012.661179&gt;" />
+        </analytic>
+        <monogr>
+                <title level="j">En: Planning Practice and Research</title>
+                <imprint>
+                        <biblScope unit="volume">27</biblScope>
+                        <biblScope unit="issue">1</biblScope>
+                        <biblScope unit="page" from="41" to="61" />
+                </imprint>
+        </monogr>
+</biblStruct>
+```
+
+There are dates, but not explicit clean 2012.
+
+Another issue:
+
+```
+https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
+```
+
+Very similar titles:
+
+"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ...
+
+* year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs)
+
+Intermediate match results:
+
+```
+141970958 Status.DIFFERENT Reason.YEAR
+106734288 Status.EXACT Reason.DOI
+ 91205561 Status.STRONG Reason.JACCARD_AUTHORS
+ 66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
+ 53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
+ 20889423 Status.STRONG Reason.TOKENIZED_AUTHORS
+  7449880 Status.AMBIGUOUS Reason.UNKNOWN
+  3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH
+  1199761 Status.DIFFERENT Reason.PAGE_COUNT
+  1121611 Status.AMBIGUOUS Reason.SHORT_TITLE
+   395710 Status.EXACT Reason.WORK_ID
+   362089 Status.DIFFERENT Reason.COMPONENT
+   351654 Status.AMBIGUOUS Reason.BLACKLISTED
+   326730 Status.STRONG Reason.VERSIONED_DOI
+   239924 Status.STRONG Reason.PREPRINT_PUBLISHED
+   171594 Status.STRONG Reason.PMID_DOI_PAIR
+    54646 Status.STRONG Reason.ARXIV_VERSION
+    49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
+    17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
+     5219 Status.DIFFERENT Reason.TITLE_FILENAME
+     2451 Status.AMBIGUOUS Reason.APPENDIX
+     1874 Status.STRONG Reason.FIGSHARE_VERSION
+     1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
+      774 Status.DIFFERENT Reason.NUM_DIFF
+      448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
+      123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
+       17 Status.STRONG Reason.CUSTOM_BSI_UNDATED
+       17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
+        6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
+```
+
+Another false negative:
+
+* https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a
+* http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji
+
+```
+https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR
+```
+
+Both docs contain 1972?
+
+```xml
+<biblStruct xml:id="b67">
+        <analytic>
+                <title level="a" type="main">Variational Wavefunctions for H2 +</title>
+                <author>
+                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">F</forename><surname>Weinhold</surname></persName>
+                </author>
+        </analytic>
+        <monogr>
+                <title level="j">J. Chem. Phys</title>
+                <imprint>
+                        <biblScope unit="volume">56</biblScope>
+                        <biblScope unit="page" from="3798" to="3801" />
+                        <date type="published" when="1972" />
+                </imprint>
+        </monogr>
+</biblStruct>
+```
+
+----
+
+Running:
+
+```
+$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus
+ter_ref_verify.tsv
+```
+
+resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons.
+
+Stats:
+
+```
+$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0
+    cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp
+    LC_ALL=C sort -S20% | uniq -c | sort -nr
+
+146095427 Status.DIFFERENT Reason.YEAR
+110052214 Status.EXACT Reason.DOI
+ 94300998 Status.STRONG Reason.JACCARD_AUTHORS
+ 68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
+ 55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
+ 21545821 Status.STRONG Reason.TOKENIZED_AUTHORS
+  7746937 Status.AMBIGUOUS Reason.UNKNOWN
+  3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH
+  1265506 Status.DIFFERENT Reason.PAGE_COUNT
+  1171178 Status.AMBIGUOUS Reason.SHORT_TITLE
+   409043 Status.EXACT Reason.WORK_ID
+   374051 Status.DIFFERENT Reason.COMPONENT
+   356772 Status.AMBIGUOUS Reason.BLACKLISTED
+   336588 Status.STRONG Reason.VERSIONED_DOI
+   249723 Status.STRONG Reason.PREPRINT_PUBLISHED
+   177547 Status.STRONG Reason.PMID_DOI_PAIR
+    56445 Status.STRONG Reason.ARXIV_VERSION
+    51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
+    17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
+     5255 Status.DIFFERENT Reason.TITLE_FILENAME
+     2451 Status.AMBIGUOUS Reason.APPENDIX
+     1946 Status.STRONG Reason.FIGSHARE_VERSION
+     1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
+      798 Status.DIFFERENT Reason.NUM_DIFF
+      463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
+      125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
+       18 Status.STRONG Reason.CUSTOM_BSI_UNDATED
+       18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
+        7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
+
+```
+
+286M positive links.
+
+```
+$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc
+286008492
+```
+
+Or 175M, if we exclude DOI and work matches.
+
+```
+$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc
+175547235
+```
+
+----
+
+The final derivation dep tree looks like:
+
+```
+ $ ./tasks.py -d BiblioRef
+ \_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+    \_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+       \_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+          \_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+             \_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                   \_ ReleaseExportExpanded()
+                   \_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                      \_ Input()
+    \_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+       \_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+          \_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+             \_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ Input()
+             \_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ ReleaseExportExpanded()
+          \_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+             \_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ ReleaseExportExpanded()
+             \_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ Input()
+          \_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+             \_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ ReleaseExportExpanded()
+             \_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ Input()
+          \_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+             \_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                   \_ ReleaseExportExpanded()
+             \_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                \_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+                   \_ Input()
+```