# Version 1 Includes: * doi, pmid, pmcid, arxiv * title-lower exact matches Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g. "introduction". 180G compressed, about 53 min for a one pass. ``` $ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \ <(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst ``` Filter and sample with `awk`, e.g. via: ``` $ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0' ``` Need to pre-filter before join, to keep join smaller. Basic inspection of the "exact lower title" set. * 16B+ candidates * as the join keys are already sorted, we can run uniq ``` $ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst real 92m28.442s user 142m49.627s sys 46m9.473s ``` Some manual sampling: Different release, but same references (585): * https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references * https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references There are duplicates in the join, need to filter them out. ``` $ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst ``` Left with about 13B uniq. OCI, example: * https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220 * OCI: 646 citations we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss? However, we do have all but one of the OCI DOIs in fatcat: ``` $ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json ``` Example, DOI not in OCI: * https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30 Possible mitigations: * ignore common titles * ignore numbers only Examples: `42` appeards 3816 times Harder cases: * "41st annual meeting" - too generic, and wrong Generic DOI lookup from OCI in fatcat: ``` $ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc . {"doi":"10.1530/erc-16-0228","status":200} {"doi":"10.1371/journal.pone.0080023","status":200} {"doi":"10.1074/jbc.m114.566141","status":200} ... ``` Overall: * 31344136 unique titles most common join title: * 11,939,631,644 introduction * also: "science", "preface", "book reviews", ..., "cell", ... Filtering: ``` $ zstdcat -T0 title_counts.tsv.zst | \ LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)' ``` About 7275 titles to filter out, e.g. ``` ... 475300 abstracts of papers 20502 ac 13892 aca 7881 academic freedom ... 5047 community policing 157176 community-acquired pneumonia 68222 commutative algebra 5512 comorbidity 5516 compact stars 8865 company ... 7353 facebook 6461 facial pain 8977 facilities 5238 facing the future 5064 fact 11198 fact sheet ... ``` Trying fuzzycat clustering, with 0.1.13, which allows to compress `-C` intermediate artifacts. ``` $ time zstdcat \ RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \ parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \ fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst ``` Using fuzzycat 0.1.13 with compression; all fine until: ``` $ time zstdcat \ RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \ -l | parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python \ -m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst 1.58G 6:35:39 [66.5k/s] [ <=> ] parallel: Error: Output is incomplete. parallel: Error: Cannot append to buffer file in /tmp. parallel: Error: Is the disk full? parallel: Error: Change $TMPDIR with --tmpdir or use --compress. real 1013m20.128s user 2696m14.290s sys 119m29.419s ``` A run with `--compress` and `--tmpdir` set on parallel worked: ``` $ time zstdcat RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --compress --tmpdir /fast/tmp -j 4 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst real 1301m26.206s user 2778m20.635s sys 140m32.121s ``` * 21h, finds 5850385 clusters (seems too low) # Sample generation Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases: * ~114M refs * ~7M releases Adjusted `tasks.py` to use a different sha1 and updated settings.ini with sample file locations. # First clustering Key extraction (KE), sorting and clustering took 14h, when the merged dataset is already there (it takes ~80min to convert refs to releases, plus a bit more to concatenate the files). ``` $ ./run.sh RefsFatcatClusters real 841m45.169s user 2872m35.481s sys 561m14.231s ``` Resulting file is 154G compressed. Cluster count and sizes: ``` $ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \ LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv ``` Follow up tasks: * each cluster will have ref and non-ref items * we want at least one non-ref item ``` $ skate-cluster -both ... ``` Will keep only those clusters that contain at least one ref and one non-ref entry. Found 40257623 clusters, iteration over the 89GB compressed file takes 28min. Raw synopsis: ``` $ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \ jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r ``` Some numbers: * [ ] number of 2-clusters, where not both entries have a doi? Verification. * needed a different batch verifier, since we do not need pairwise comparisons; ``` $ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr 8390899 Status.DIFFERENT Reason.YEAR 6191622 Status.EXACT Reason.DOI 5468805 Status.STRONG Reason.JACCARD_AUTHORS 3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY 3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH 1263329 Status.STRONG Reason.TOKENIZED_AUTHORS 424441 Status.AMBIGUOUS Reason.UNKNOWN 199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH 138144 Status.AMBIGUOUS Reason.SHORT_TITLE 92054 Status.DIFFERENT Reason.PAGE_COUNT 25122 Status.AMBIGUOUS Reason.BLACKLISTED 22964 Status.EXACT Reason.WORK_ID 17702 Status.STRONG Reason.VERSIONED_DOI 16236 Status.DIFFERENT Reason.COMPONENT 14462 Status.STRONG Reason.PREPRINT_PUBLISHED 9632 Status.STRONG Reason.PMID_DOI_PAIR 3429 Status.STRONG Reason.ARXIV_VERSION 3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV 729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW 195 Status.STRONG Reason.FIGSHARE_VERSION 76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN 74 Status.DIFFERENT Reason.TITLE_FILENAME 43 Status.DIFFERENT Reason.NUM_DIFF 22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916 11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT 1 Status.STRONG Reason.CUSTOM_BSI_UNDATED ``` Guessing: Maybe 30% "strong", so maybe ~120M new edges? ---- # Manual sampling and issues ``` https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR ``` Grobid output: ```xml The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach LServillo Van Den PBroeck 10.1080/02697459.2012.661179> En lĂ­nea] 2012 [Fecha de consulta: 21 de agosto 2015 En: Planning Practice and Research 27 1 ``` There are dates, but not explicit clean 2012. Another issue: ``` https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH ``` Very similar titles: "... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ... * year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs) Intermediate match results: ``` 141970958 Status.DIFFERENT Reason.YEAR 106734288 Status.EXACT Reason.DOI 91205561 Status.STRONG Reason.JACCARD_AUTHORS 66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY 53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH 20889423 Status.STRONG Reason.TOKENIZED_AUTHORS 7449880 Status.AMBIGUOUS Reason.UNKNOWN 3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH 1199761 Status.DIFFERENT Reason.PAGE_COUNT 1121611 Status.AMBIGUOUS Reason.SHORT_TITLE 395710 Status.EXACT Reason.WORK_ID 362089 Status.DIFFERENT Reason.COMPONENT 351654 Status.AMBIGUOUS Reason.BLACKLISTED 326730 Status.STRONG Reason.VERSIONED_DOI 239924 Status.STRONG Reason.PREPRINT_PUBLISHED 171594 Status.STRONG Reason.PMID_DOI_PAIR 54646 Status.STRONG Reason.ARXIV_VERSION 49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV 17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW 5219 Status.DIFFERENT Reason.TITLE_FILENAME 2451 Status.AMBIGUOUS Reason.APPENDIX 1874 Status.STRONG Reason.FIGSHARE_VERSION 1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN 774 Status.DIFFERENT Reason.NUM_DIFF 448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916 123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT 17 Status.STRONG Reason.CUSTOM_BSI_UNDATED 17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288 6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC ``` Another false negative: * https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a * http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji ``` https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR ``` Both docs contain 1972? ```xml Variational Wavefunctions for H2 + FWeinhold J. Chem. Phys 56 ``` ---- Running: ``` $ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus ter_ref_verify.tsv ``` resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons. Stats: ``` $ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0 cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp LC_ALL=C sort -S20% | uniq -c | sort -nr 146095427 Status.DIFFERENT Reason.YEAR 110052214 Status.EXACT Reason.DOI 94300998 Status.STRONG Reason.JACCARD_AUTHORS 68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY 55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH 21545821 Status.STRONG Reason.TOKENIZED_AUTHORS 7746937 Status.AMBIGUOUS Reason.UNKNOWN 3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH 1265506 Status.DIFFERENT Reason.PAGE_COUNT 1171178 Status.AMBIGUOUS Reason.SHORT_TITLE 409043 Status.EXACT Reason.WORK_ID 374051 Status.DIFFERENT Reason.COMPONENT 356772 Status.AMBIGUOUS Reason.BLACKLISTED 336588 Status.STRONG Reason.VERSIONED_DOI 249723 Status.STRONG Reason.PREPRINT_PUBLISHED 177547 Status.STRONG Reason.PMID_DOI_PAIR 56445 Status.STRONG Reason.ARXIV_VERSION 51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV 17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW 5255 Status.DIFFERENT Reason.TITLE_FILENAME 2451 Status.AMBIGUOUS Reason.APPENDIX 1946 Status.STRONG Reason.FIGSHARE_VERSION 1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN 798 Status.DIFFERENT Reason.NUM_DIFF 463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916 125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT 18 Status.STRONG Reason.CUSTOM_BSI_UNDATED 18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288 7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC ``` 286M positive links. ``` $ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc 286008492 ``` Or 175M, if we exclude DOI and work matches. ``` $ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc 175547235 ``` ---- The final derivation dep tree looks like: ``` $ ./tasks.py -d BiblioRef \_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ ReleaseExportExpanded() \_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ Input() \_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ Input() \_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ ReleaseExportExpanded() \_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ ReleaseExportExpanded() \_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ Input() \_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ ReleaseExportExpanded() \_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ Input() \_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ ReleaseExportExpanded() \_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0) \_ Input() ```