diff options
Diffstat (limited to 'python/notes/version_1.md')
-rw-r--r-- | python/notes/version_1.md | 491 |
1 files changed, 491 insertions, 0 deletions
diff --git a/python/notes/version_1.md b/python/notes/version_1.md new file mode 100644 index 0000000..50a38cc --- /dev/null +++ b/python/notes/version_1.md @@ -0,0 +1,491 @@ +# Version 1 + +Includes: + +* doi, pmid, pmcid, arxiv +* title-lower exact matches + +Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g. +"introduction". 180G compressed, about 53 min for a one pass. + +``` +$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \ + <(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst +``` + +Filter and sample with `awk`, e.g. via: + +``` +$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0' +``` + +Need to pre-filter before join, to keep join smaller. + +Basic inspection of the "exact lower title" set. + +* 16B+ candidates +* as the join keys are already sorted, we can run uniq + +``` +$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst + +real 92m28.442s +user 142m49.627s +sys 46m9.473s +``` + +Some manual sampling: + +Different release, but same references (585): + +* https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references +* https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references + +There are duplicates in the join, need to filter them out. + +``` +$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst +``` + +Left with about 13B uniq. + +OCI, example: + +* https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220 +* OCI: 646 citations + +we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss? + +However, we do have all but one of the OCI DOIs in fatcat: + +``` +$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json +``` + +Example, DOI not in OCI: + +* https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30 + +Possible mitigations: + +* ignore common titles +* ignore numbers only + +Examples: `42` appeards 3816 times + +Harder cases: + +* "41st annual meeting" - too generic, and wrong + + +Generic DOI lookup from OCI in fatcat: + +``` +$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc . +{"doi":"10.1530/erc-16-0228","status":200} +{"doi":"10.1371/journal.pone.0080023","status":200} +{"doi":"10.1074/jbc.m114.566141","status":200} +... +``` + +Overall: + +* 31344136 unique titles + +most common join title: + +* 11,939,631,644 introduction +* also: "science", "preface", "book reviews", ..., "cell", ... + +Filtering: + +``` +$ zstdcat -T0 title_counts.tsv.zst | \ + LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)' +``` + +About 7275 titles to filter out, e.g. + +``` +... + 475300 abstracts of papers + 20502 ac + 13892 aca + 7881 academic freedom +... + 5047 community policing + 157176 community-acquired pneumonia + 68222 commutative algebra + 5512 comorbidity + 5516 compact stars + 8865 company +... + 7353 facebook + 6461 facial pain + 8977 facilities + 5238 facing the future + 5064 fact + 11198 fact sheet +... +``` + +Trying fuzzycat clustering, with 0.1.13, which allows to compress `-C` +intermediate artifacts. + +``` +$ time zstdcat \ + RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \ + parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \ + fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst +``` + +Using fuzzycat 0.1.13 with compression; all fine until: + +``` +$ time zstdcat \ + RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \ + -l | parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python \ + -m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst + +1.58G 6:35:39 [66.5k/s] [ <=> ] +parallel: Error: Output is incomplete. +parallel: Error: Cannot append to buffer file in /tmp. +parallel: Error: Is the disk full? +parallel: Error: Change $TMPDIR with --tmpdir or use --compress. + +real 1013m20.128s +user 2696m14.290s +sys 119m29.419s +``` + +A run with `--compress` and `--tmpdir` set on parallel worked: + +``` +$ time zstdcat + RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | + parallel --compress --tmpdir /fast/tmp -j 4 --block 10M --roundrobin --pipe + 'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' | + zstd -T0 -c > cluster.ndj.zst + +real 1301m26.206s +user 2778m20.635s +sys 140m32.121s +``` + +* 21h, finds 5850385 clusters (seems too low) + +# Sample generation + +Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases: + +* ~114M refs +* ~7M releases + +Adjusted `tasks.py` to use a different sha1 and updated settings.ini with +sample file locations. + +# First clustering + +Key extraction (KE), sorting and clustering took 14h, when the merged dataset +is already there (it takes ~80min to convert refs to releases, plus a bit more +to concatenate the files). + +``` +$ ./run.sh RefsFatcatClusters + +real 841m45.169s +user 2872m35.481s +sys 561m14.231s +``` + +Resulting file is 154G compressed. + +Cluster count and sizes: + +``` +$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \ + LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv +``` + +Follow up tasks: + +* each cluster will have ref and non-ref items +* we want at least one non-ref item + +``` +$ skate-cluster -both ... +``` + +Will keep only those clusters that contain at least one ref and one non-ref +entry. + +Found 40257623 clusters, iteration over the 89GB compressed file takes 28min. + +Raw synopsis: + +``` +$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \ + jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r +``` + +Some numbers: + +* [ ] number of 2-clusters, where not both entries have a doi? + +Verification. + +* needed a different batch verifier, since we do not need pairwise comparisons; + +``` +$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr +8390899 Status.DIFFERENT Reason.YEAR +6191622 Status.EXACT Reason.DOI +5468805 Status.STRONG Reason.JACCARD_AUTHORS +3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY +3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH +1263329 Status.STRONG Reason.TOKENIZED_AUTHORS + 424441 Status.AMBIGUOUS Reason.UNKNOWN + 199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH + 138144 Status.AMBIGUOUS Reason.SHORT_TITLE + 92054 Status.DIFFERENT Reason.PAGE_COUNT + 25122 Status.AMBIGUOUS Reason.BLACKLISTED + 22964 Status.EXACT Reason.WORK_ID + 17702 Status.STRONG Reason.VERSIONED_DOI + 16236 Status.DIFFERENT Reason.COMPONENT + 14462 Status.STRONG Reason.PREPRINT_PUBLISHED + 9632 Status.STRONG Reason.PMID_DOI_PAIR + 3429 Status.STRONG Reason.ARXIV_VERSION + 3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV + 729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW + 195 Status.STRONG Reason.FIGSHARE_VERSION + 76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN + 74 Status.DIFFERENT Reason.TITLE_FILENAME + 43 Status.DIFFERENT Reason.NUM_DIFF + 22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916 + 11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT + 1 Status.STRONG Reason.CUSTOM_BSI_UNDATED +``` + +Guessing: Maybe 30% "strong", so maybe ~120M new edges? + + +---- + +# Manual sampling and issues + +``` +https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR +``` + +Grobid output: + +```xml +<biblStruct xml:id="b77"> + <analytic> + <title level="a" type="main">The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach</title> + <author> + <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">L</forename><surname>Servillo</surname></persName> + </author> + <author> + <persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Van Den</surname></persName> + </author> + <author> + <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">P</forename><surname>Broeck</surname></persName> + </author> + <idno type="DOI">10.1080/02697459.2012.661179></idno> + <idno>En lĂnea] 2012 [Fecha de consulta: 21 de agosto 2015</idno> + <ptr target="<http://dx.doi.org/10.1080/02697459.2012.661179>" /> + </analytic> + <monogr> + <title level="j">En: Planning Practice and Research</title> + <imprint> + <biblScope unit="volume">27</biblScope> + <biblScope unit="issue">1</biblScope> + <biblScope unit="page" from="41" to="61" /> + </imprint> + </monogr> +</biblStruct> +``` + +There are dates, but not explicit clean 2012. + +Another issue: + +``` +https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH +``` + +Very similar titles: + +"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ... + +* year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs) + +Intermediate match results: + +``` +141970958 Status.DIFFERENT Reason.YEAR +106734288 Status.EXACT Reason.DOI + 91205561 Status.STRONG Reason.JACCARD_AUTHORS + 66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY + 53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH + 20889423 Status.STRONG Reason.TOKENIZED_AUTHORS + 7449880 Status.AMBIGUOUS Reason.UNKNOWN + 3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH + 1199761 Status.DIFFERENT Reason.PAGE_COUNT + 1121611 Status.AMBIGUOUS Reason.SHORT_TITLE + 395710 Status.EXACT Reason.WORK_ID + 362089 Status.DIFFERENT Reason.COMPONENT + 351654 Status.AMBIGUOUS Reason.BLACKLISTED + 326730 Status.STRONG Reason.VERSIONED_DOI + 239924 Status.STRONG Reason.PREPRINT_PUBLISHED + 171594 Status.STRONG Reason.PMID_DOI_PAIR + 54646 Status.STRONG Reason.ARXIV_VERSION + 49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV + 17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW + 5219 Status.DIFFERENT Reason.TITLE_FILENAME + 2451 Status.AMBIGUOUS Reason.APPENDIX + 1874 Status.STRONG Reason.FIGSHARE_VERSION + 1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN + 774 Status.DIFFERENT Reason.NUM_DIFF + 448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916 + 123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT + 17 Status.STRONG Reason.CUSTOM_BSI_UNDATED + 17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288 + 6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC +``` + +Another false negative: + +* https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a +* http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji + +``` +https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR +``` + +Both docs contain 1972? + +```xml +<biblStruct xml:id="b67"> + <analytic> + <title level="a" type="main">Variational Wavefunctions for H2 +</title> + <author> + <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">F</forename><surname>Weinhold</surname></persName> + </author> + </analytic> + <monogr> + <title level="j">J. Chem. Phys</title> + <imprint> + <biblScope unit="volume">56</biblScope> + <biblScope unit="page" from="3798" to="3801" /> + <date type="published" when="1972" /> + </imprint> + </monogr> +</biblStruct> +``` + +---- + +Running: + +``` +$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus +ter_ref_verify.tsv +``` + +resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons. + +Stats: + +``` +$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0 + cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp + LC_ALL=C sort -S20% | uniq -c | sort -nr + +146095427 Status.DIFFERENT Reason.YEAR +110052214 Status.EXACT Reason.DOI + 94300998 Status.STRONG Reason.JACCARD_AUTHORS + 68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY + 55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH + 21545821 Status.STRONG Reason.TOKENIZED_AUTHORS + 7746937 Status.AMBIGUOUS Reason.UNKNOWN + 3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH + 1265506 Status.DIFFERENT Reason.PAGE_COUNT + 1171178 Status.AMBIGUOUS Reason.SHORT_TITLE + 409043 Status.EXACT Reason.WORK_ID + 374051 Status.DIFFERENT Reason.COMPONENT + 356772 Status.AMBIGUOUS Reason.BLACKLISTED + 336588 Status.STRONG Reason.VERSIONED_DOI + 249723 Status.STRONG Reason.PREPRINT_PUBLISHED + 177547 Status.STRONG Reason.PMID_DOI_PAIR + 56445 Status.STRONG Reason.ARXIV_VERSION + 51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV + 17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW + 5255 Status.DIFFERENT Reason.TITLE_FILENAME + 2451 Status.AMBIGUOUS Reason.APPENDIX + 1946 Status.STRONG Reason.FIGSHARE_VERSION + 1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN + 798 Status.DIFFERENT Reason.NUM_DIFF + 463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916 + 125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT + 18 Status.STRONG Reason.CUSTOM_BSI_UNDATED + 18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288 + 7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC + +``` + +286M positive links. + +``` +$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc +286008492 +``` + +Or 175M, if we exclude DOI and work matches. + +``` +$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc +175547235 +``` + +---- + +The final derivation dep tree looks like: + +``` + $ ./tasks.py -d BiblioRef + \_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ ReleaseExportExpanded() + \_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ Input() + \_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ Input() + \_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ ReleaseExportExpanded() + \_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ ReleaseExportExpanded() + \_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ Input() + \_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ ReleaseExportExpanded() + \_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ Input() + \_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ ReleaseExportExpanded() + \_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0) + \_ Input() +``` |