aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/version_1.md
diff options
context:
space:
mode:
Diffstat (limited to 'python/notes/version_1.md')
-rw-r--r--python/notes/version_1.md491
1 files changed, 491 insertions, 0 deletions
diff --git a/python/notes/version_1.md b/python/notes/version_1.md
new file mode 100644
index 0000000..50a38cc
--- /dev/null
+++ b/python/notes/version_1.md
@@ -0,0 +1,491 @@
+# Version 1
+
+Includes:
+
+* doi, pmid, pmcid, arxiv
+* title-lower exact matches
+
+Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g.
+"introduction". 180G compressed, about 53 min for a one pass.
+
+```
+$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \
+ <(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst
+```
+
+Filter and sample with `awk`, e.g. via:
+
+```
+$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0'
+```
+
+Need to pre-filter before join, to keep join smaller.
+
+Basic inspection of the "exact lower title" set.
+
+* 16B+ candidates
+* as the join keys are already sorted, we can run uniq
+
+```
+$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst
+
+real 92m28.442s
+user 142m49.627s
+sys 46m9.473s
+```
+
+Some manual sampling:
+
+Different release, but same references (585):
+
+* https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references
+* https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references
+
+There are duplicates in the join, need to filter them out.
+
+```
+$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst
+```
+
+Left with about 13B uniq.
+
+OCI, example:
+
+* https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220
+* OCI: 646 citations
+
+we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss?
+
+However, we do have all but one of the OCI DOIs in fatcat:
+
+```
+$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json
+```
+
+Example, DOI not in OCI:
+
+* https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30
+
+Possible mitigations:
+
+* ignore common titles
+* ignore numbers only
+
+Examples: `42` appeards 3816 times
+
+Harder cases:
+
+* "41st annual meeting" - too generic, and wrong
+
+
+Generic DOI lookup from OCI in fatcat:
+
+```
+$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc .
+{"doi":"10.1530/erc-16-0228","status":200}
+{"doi":"10.1371/journal.pone.0080023","status":200}
+{"doi":"10.1074/jbc.m114.566141","status":200}
+...
+```
+
+Overall:
+
+* 31344136 unique titles
+
+most common join title:
+
+* 11,939,631,644 introduction
+* also: "science", "preface", "book reviews", ..., "cell", ...
+
+Filtering:
+
+```
+$ zstdcat -T0 title_counts.tsv.zst | \
+ LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)'
+```
+
+About 7275 titles to filter out, e.g.
+
+```
+...
+ 475300 abstracts of papers
+ 20502 ac
+ 13892 aca
+ 7881 academic freedom
+...
+ 5047 community policing
+ 157176 community-acquired pneumonia
+ 68222 commutative algebra
+ 5512 comorbidity
+ 5516 compact stars
+ 8865 company
+...
+ 7353 facebook
+ 6461 facial pain
+ 8977 facilities
+ 5238 facing the future
+ 5064 fact
+ 11198 fact sheet
+...
+```
+
+Trying fuzzycat clustering, with 0.1.13, which allows to compress `-C`
+intermediate artifacts.
+
+```
+$ time zstdcat \
+ RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
+ parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \
+ fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst
+```
+
+Using fuzzycat 0.1.13 with compression; all fine until:
+
+```
+$ time zstdcat \
+ RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \
+ -l | parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python \
+ -m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst
+
+1.58G 6:35:39 [66.5k/s] [ <=> ]
+parallel: Error: Output is incomplete.
+parallel: Error: Cannot append to buffer file in /tmp.
+parallel: Error: Is the disk full?
+parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
+
+real 1013m20.128s
+user 2696m14.290s
+sys 119m29.419s
+```
+
+A run with `--compress` and `--tmpdir` set on parallel worked:
+
+```
+$ time zstdcat
+ RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst |
+ parallel --compress --tmpdir /fast/tmp -j 4 --block 10M --roundrobin --pipe
+ 'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' |
+ zstd -T0 -c > cluster.ndj.zst
+
+real 1301m26.206s
+user 2778m20.635s
+sys 140m32.121s
+```
+
+* 21h, finds 5850385 clusters (seems too low)
+
+# Sample generation
+
+Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases:
+
+* ~114M refs
+* ~7M releases
+
+Adjusted `tasks.py` to use a different sha1 and updated settings.ini with
+sample file locations.
+
+# First clustering
+
+Key extraction (KE), sorting and clustering took 14h, when the merged dataset
+is already there (it takes ~80min to convert refs to releases, plus a bit more
+to concatenate the files).
+
+```
+$ ./run.sh RefsFatcatClusters
+
+real 841m45.169s
+user 2872m35.481s
+sys 561m14.231s
+```
+
+Resulting file is 154G compressed.
+
+Cluster count and sizes:
+
+```
+$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
+ LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv
+```
+
+Follow up tasks:
+
+* each cluster will have ref and non-ref items
+* we want at least one non-ref item
+
+```
+$ skate-cluster -both ...
+```
+
+Will keep only those clusters that contain at least one ref and one non-ref
+entry.
+
+Found 40257623 clusters, iteration over the 89GB compressed file takes 28min.
+
+Raw synopsis:
+
+```
+$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
+ jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r
+```
+
+Some numbers:
+
+* [ ] number of 2-clusters, where not both entries have a doi?
+
+Verification.
+
+* needed a different batch verifier, since we do not need pairwise comparisons;
+
+```
+$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr
+8390899 Status.DIFFERENT Reason.YEAR
+6191622 Status.EXACT Reason.DOI
+5468805 Status.STRONG Reason.JACCARD_AUTHORS
+3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
+3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
+1263329 Status.STRONG Reason.TOKENIZED_AUTHORS
+ 424441 Status.AMBIGUOUS Reason.UNKNOWN
+ 199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH
+ 138144 Status.AMBIGUOUS Reason.SHORT_TITLE
+ 92054 Status.DIFFERENT Reason.PAGE_COUNT
+ 25122 Status.AMBIGUOUS Reason.BLACKLISTED
+ 22964 Status.EXACT Reason.WORK_ID
+ 17702 Status.STRONG Reason.VERSIONED_DOI
+ 16236 Status.DIFFERENT Reason.COMPONENT
+ 14462 Status.STRONG Reason.PREPRINT_PUBLISHED
+ 9632 Status.STRONG Reason.PMID_DOI_PAIR
+ 3429 Status.STRONG Reason.ARXIV_VERSION
+ 3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
+ 729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
+ 195 Status.STRONG Reason.FIGSHARE_VERSION
+ 76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
+ 74 Status.DIFFERENT Reason.TITLE_FILENAME
+ 43 Status.DIFFERENT Reason.NUM_DIFF
+ 22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
+ 11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
+ 1 Status.STRONG Reason.CUSTOM_BSI_UNDATED
+```
+
+Guessing: Maybe 30% "strong", so maybe ~120M new edges?
+
+
+----
+
+# Manual sampling and issues
+
+```
+https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR
+```
+
+Grobid output:
+
+```xml
+<biblStruct xml:id="b77">
+ <analytic>
+ <title level="a" type="main">The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach</title>
+ <author>
+ <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">L</forename><surname>Servillo</surname></persName>
+ </author>
+ <author>
+ <persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Van Den</surname></persName>
+ </author>
+ <author>
+ <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">P</forename><surname>Broeck</surname></persName>
+ </author>
+ <idno type="DOI">10.1080/02697459.2012.661179&gt;</idno>
+ <idno>En lĂ­nea] 2012 [Fecha de consulta: 21 de agosto 2015</idno>
+ <ptr target="&lt;http://dx.doi.org/10.1080/02697459.2012.661179&gt;" />
+ </analytic>
+ <monogr>
+ <title level="j">En: Planning Practice and Research</title>
+ <imprint>
+ <biblScope unit="volume">27</biblScope>
+ <biblScope unit="issue">1</biblScope>
+ <biblScope unit="page" from="41" to="61" />
+ </imprint>
+ </monogr>
+</biblStruct>
+```
+
+There are dates, but not explicit clean 2012.
+
+Another issue:
+
+```
+https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
+```
+
+Very similar titles:
+
+"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ...
+
+* year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs)
+
+Intermediate match results:
+
+```
+141970958 Status.DIFFERENT Reason.YEAR
+106734288 Status.EXACT Reason.DOI
+ 91205561 Status.STRONG Reason.JACCARD_AUTHORS
+ 66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
+ 53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
+ 20889423 Status.STRONG Reason.TOKENIZED_AUTHORS
+ 7449880 Status.AMBIGUOUS Reason.UNKNOWN
+ 3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH
+ 1199761 Status.DIFFERENT Reason.PAGE_COUNT
+ 1121611 Status.AMBIGUOUS Reason.SHORT_TITLE
+ 395710 Status.EXACT Reason.WORK_ID
+ 362089 Status.DIFFERENT Reason.COMPONENT
+ 351654 Status.AMBIGUOUS Reason.BLACKLISTED
+ 326730 Status.STRONG Reason.VERSIONED_DOI
+ 239924 Status.STRONG Reason.PREPRINT_PUBLISHED
+ 171594 Status.STRONG Reason.PMID_DOI_PAIR
+ 54646 Status.STRONG Reason.ARXIV_VERSION
+ 49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
+ 17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
+ 5219 Status.DIFFERENT Reason.TITLE_FILENAME
+ 2451 Status.AMBIGUOUS Reason.APPENDIX
+ 1874 Status.STRONG Reason.FIGSHARE_VERSION
+ 1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
+ 774 Status.DIFFERENT Reason.NUM_DIFF
+ 448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
+ 123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
+ 17 Status.STRONG Reason.CUSTOM_BSI_UNDATED
+ 17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
+ 6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
+```
+
+Another false negative:
+
+* https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a
+* http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji
+
+```
+https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR
+```
+
+Both docs contain 1972?
+
+```xml
+<biblStruct xml:id="b67">
+ <analytic>
+ <title level="a" type="main">Variational Wavefunctions for H2 +</title>
+ <author>
+ <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">F</forename><surname>Weinhold</surname></persName>
+ </author>
+ </analytic>
+ <monogr>
+ <title level="j">J. Chem. Phys</title>
+ <imprint>
+ <biblScope unit="volume">56</biblScope>
+ <biblScope unit="page" from="3798" to="3801" />
+ <date type="published" when="1972" />
+ </imprint>
+ </monogr>
+</biblStruct>
+```
+
+----
+
+Running:
+
+```
+$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus
+ter_ref_verify.tsv
+```
+
+resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons.
+
+Stats:
+
+```
+$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0
+ cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp
+ LC_ALL=C sort -S20% | uniq -c | sort -nr
+
+146095427 Status.DIFFERENT Reason.YEAR
+110052214 Status.EXACT Reason.DOI
+ 94300998 Status.STRONG Reason.JACCARD_AUTHORS
+ 68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
+ 55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
+ 21545821 Status.STRONG Reason.TOKENIZED_AUTHORS
+ 7746937 Status.AMBIGUOUS Reason.UNKNOWN
+ 3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH
+ 1265506 Status.DIFFERENT Reason.PAGE_COUNT
+ 1171178 Status.AMBIGUOUS Reason.SHORT_TITLE
+ 409043 Status.EXACT Reason.WORK_ID
+ 374051 Status.DIFFERENT Reason.COMPONENT
+ 356772 Status.AMBIGUOUS Reason.BLACKLISTED
+ 336588 Status.STRONG Reason.VERSIONED_DOI
+ 249723 Status.STRONG Reason.PREPRINT_PUBLISHED
+ 177547 Status.STRONG Reason.PMID_DOI_PAIR
+ 56445 Status.STRONG Reason.ARXIV_VERSION
+ 51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
+ 17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
+ 5255 Status.DIFFERENT Reason.TITLE_FILENAME
+ 2451 Status.AMBIGUOUS Reason.APPENDIX
+ 1946 Status.STRONG Reason.FIGSHARE_VERSION
+ 1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
+ 798 Status.DIFFERENT Reason.NUM_DIFF
+ 463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
+ 125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
+ 18 Status.STRONG Reason.CUSTOM_BSI_UNDATED
+ 18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
+ 7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
+
+```
+
+286M positive links.
+
+```
+$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc
+286008492
+```
+
+Or 175M, if we exclude DOI and work matches.
+
+```
+$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc
+175547235
+```
+
+----
+
+The final derivation dep tree looks like:
+
+```
+ $ ./tasks.py -d BiblioRef
+ \_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ ReleaseExportExpanded()
+ \_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ Input()
+ \_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ Input()
+ \_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ ReleaseExportExpanded()
+ \_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ ReleaseExportExpanded()
+ \_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ Input()
+ \_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ ReleaseExportExpanded()
+ \_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ Input()
+ \_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ ReleaseExportExpanded()
+ \_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
+ \_ Input()
+```