aboutsummaryrefslogtreecommitdiffstats

Version 1

Includes:

  • doi, pmid, pmcid, arxiv
  • title-lower exact matches

Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g. "introduction". 180G compressed, about 53 min for a one pass.

$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \
    <(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst

Filter and sample with awk, e.g. via:

$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0'

Need to pre-filter before join, to keep join smaller.

Basic inspection of the "exact lower title" set.

  • 16B+ candidates
  • as the join keys are already sorted, we can run uniq
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst

real    92m28.442s
user    142m49.627s
sys     46m9.473s

Some manual sampling:

Different release, but same references (585):

  • https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references
  • https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references

There are duplicates in the join, need to filter them out.

$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst

Left with about 13B uniq.

OCI, example:

  • https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220
  • OCI: 646 citations

we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss?

However, we do have all but one of the OCI DOIs in fatcat:

$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json

Example, DOI not in OCI:

  • https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30

Possible mitigations:

  • ignore common titles
  • ignore numbers only

Examples: 42 appeards 3816 times

Harder cases:

  • "41st annual meeting" - too generic, and wrong

Generic DOI lookup from OCI in fatcat:

$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc .
{"doi":"10.1530/erc-16-0228","status":200}
{"doi":"10.1371/journal.pone.0080023","status":200}
{"doi":"10.1074/jbc.m114.566141","status":200}
...

Overall:

  • 31344136 unique titles

most common join title:

  • 11,939,631,644 introduction
  • also: "science", "preface", "book reviews", ..., "cell", ...

Filtering:

$ zstdcat -T0 title_counts.tsv.zst | \
    LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)'

About 7275 titles to filter out, e.g.

...
 475300 abstracts of papers
  20502 ac
  13892 aca
   7881 academic freedom
...
   5047 community policing
 157176 community-acquired pneumonia
  68222 commutative algebra
   5512 comorbidity
   5516 compact stars
   8865 company
...
   7353 facebook
   6461 facial pain
   8977 facilities
   5238 facing the future
   5064 fact
  11198 fact sheet
...

Trying fuzzycat clustering, with 0.1.13, which allows to compress -C intermediate artifacts.

$ time zstdcat \
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    parallel -j 16 --block 10M  --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \
    fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst

Using fuzzycat 0.1.13 with compression; all fine until:

$ time zstdcat \
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \
    -l | parallel -j 16 --block 10M  --roundrobin --pipe 'TMPDIR=/bigger/tmp python \
    -m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst

1.58G 6:35:39 [66.5k/s] [                                                                                                                 <=>                                                                                                 ]
parallel: Error: Output is incomplete.
parallel: Error: Cannot append to buffer file in /tmp.
parallel: Error: Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.

real    1013m20.128s
user    2696m14.290s
sys     119m29.419s

A run with --compress and --tmpdir set on parallel worked:

$ time zstdcat
    RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst |
    parallel --compress --tmpdir /fast/tmp -j 4 --block 10M  --roundrobin --pipe
    'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' |
    zstd -T0 -c > cluster.ndj.zst

real    1301m26.206s
user    2778m20.635s
sys     140m32.121s
  • 21h, finds 5850385 clusters (seems too low)

Sample generation

Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases:

  • ~114M refs
  • ~7M releases

Adjusted tasks.py to use a different sha1 and updated settings.ini with sample file locations.

First clustering

Key extraction (KE), sorting and clustering took 14h, when the merged dataset is already there (it takes ~80min to convert refs to releases, plus a bit more to concatenate the files).

$ ./run.sh RefsFatcatClusters

real    841m45.169s
user    2872m35.481s
sys     561m14.231s

Resulting file is 154G compressed.

Cluster count and sizes:

$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv

Follow up tasks:

  • each cluster will have ref and non-ref items
  • we want at least one non-ref item
$ skate-cluster -both ...

Will keep only those clusters that contain at least one ref and one non-ref entry.

Found 40257623 clusters, iteration over the 89GB compressed file takes 28min.

Raw synopsis:

$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
    jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r

Some numbers:

  • [ ] number of 2-clusters, where not both entries have a doi?

Verification.

  • needed a different batch verifier, since we do not need pairwise comparisons;
$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr
8390899 Status.DIFFERENT Reason.YEAR
6191622 Status.EXACT Reason.DOI
5468805 Status.STRONG Reason.JACCARD_AUTHORS
3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
1263329 Status.STRONG Reason.TOKENIZED_AUTHORS
 424441 Status.AMBIGUOUS Reason.UNKNOWN
 199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH
 138144 Status.AMBIGUOUS Reason.SHORT_TITLE
  92054 Status.DIFFERENT Reason.PAGE_COUNT
  25122 Status.AMBIGUOUS Reason.BLACKLISTED
  22964 Status.EXACT Reason.WORK_ID
  17702 Status.STRONG Reason.VERSIONED_DOI
  16236 Status.DIFFERENT Reason.COMPONENT
  14462 Status.STRONG Reason.PREPRINT_PUBLISHED
   9632 Status.STRONG Reason.PMID_DOI_PAIR
   3429 Status.STRONG Reason.ARXIV_VERSION
   3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
    195 Status.STRONG Reason.FIGSHARE_VERSION
     76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
     74 Status.DIFFERENT Reason.TITLE_FILENAME
     43 Status.DIFFERENT Reason.NUM_DIFF
     22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
     11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
      1 Status.STRONG Reason.CUSTOM_BSI_UNDATED

Guessing: Maybe 30% "strong", so maybe ~120M new edges?


Manual sampling and issues

https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR

Grobid output:

<biblStruct xml:id="b77">
        <analytic>
                <title level="a" type="main">The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach</title>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">L</forename><surname>Servillo</surname></persName>
                </author>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Van Den</surname></persName>
                </author>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">P</forename><surname>Broeck</surname></persName>
                </author>
                <idno type="DOI">10.1080/02697459.2012.661179&gt;</idno>
                <idno>En lĂ­nea] 2012 [Fecha de consulta: 21 de agosto 2015</idno>
                <ptr target="&lt;http://dx.doi.org/10.1080/02697459.2012.661179&gt;" />
        </analytic>
        <monogr>
                <title level="j">En: Planning Practice and Research</title>
                <imprint>
                        <biblScope unit="volume">27</biblScope>
                        <biblScope unit="issue">1</biblScope>
                        <biblScope unit="page" from="41" to="61" />
                </imprint>
        </monogr>
</biblStruct>

There are dates, but not explicit clean 2012.

Another issue:

https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH

Very similar titles:

"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ...

  • year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs)

Intermediate match results:

141970958 Status.DIFFERENT Reason.YEAR
106734288 Status.EXACT Reason.DOI
 91205561 Status.STRONG Reason.JACCARD_AUTHORS
 66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
 53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
 20889423 Status.STRONG Reason.TOKENIZED_AUTHORS
  7449880 Status.AMBIGUOUS Reason.UNKNOWN
  3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH
  1199761 Status.DIFFERENT Reason.PAGE_COUNT
  1121611 Status.AMBIGUOUS Reason.SHORT_TITLE
   395710 Status.EXACT Reason.WORK_ID
   362089 Status.DIFFERENT Reason.COMPONENT
   351654 Status.AMBIGUOUS Reason.BLACKLISTED
   326730 Status.STRONG Reason.VERSIONED_DOI
   239924 Status.STRONG Reason.PREPRINT_PUBLISHED
   171594 Status.STRONG Reason.PMID_DOI_PAIR
    54646 Status.STRONG Reason.ARXIV_VERSION
    49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
     5219 Status.DIFFERENT Reason.TITLE_FILENAME
     2451 Status.AMBIGUOUS Reason.APPENDIX
     1874 Status.STRONG Reason.FIGSHARE_VERSION
     1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
      774 Status.DIFFERENT Reason.NUM_DIFF
      448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
      123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
       17 Status.STRONG Reason.CUSTOM_BSI_UNDATED
       17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
        6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC

Another false negative:

  • https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a
  • http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji
https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR

Both docs contain 1972?

<biblStruct xml:id="b67">
        <analytic>
                <title level="a" type="main">Variational Wavefunctions for H2 +</title>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">F</forename><surname>Weinhold</surname></persName>
                </author>
        </analytic>
        <monogr>
                <title level="j">J. Chem. Phys</title>
                <imprint>
                        <biblScope unit="volume">56</biblScope>
                        <biblScope unit="page" from="3798" to="3801" />
                        <date type="published" when="1972" />
                </imprint>
        </monogr>
</biblStruct>

Running:

$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus
ter_ref_verify.tsv

resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons.

Stats:

$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0
    cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp
    LC_ALL=C sort -S20% | uniq -c | sort -nr

146095427 Status.DIFFERENT Reason.YEAR
110052214 Status.EXACT Reason.DOI
 94300998 Status.STRONG Reason.JACCARD_AUTHORS
 68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
 55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
 21545821 Status.STRONG Reason.TOKENIZED_AUTHORS
  7746937 Status.AMBIGUOUS Reason.UNKNOWN
  3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH
  1265506 Status.DIFFERENT Reason.PAGE_COUNT
  1171178 Status.AMBIGUOUS Reason.SHORT_TITLE
   409043 Status.EXACT Reason.WORK_ID
   374051 Status.DIFFERENT Reason.COMPONENT
   356772 Status.AMBIGUOUS Reason.BLACKLISTED
   336588 Status.STRONG Reason.VERSIONED_DOI
   249723 Status.STRONG Reason.PREPRINT_PUBLISHED
   177547 Status.STRONG Reason.PMID_DOI_PAIR
    56445 Status.STRONG Reason.ARXIV_VERSION
    51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
    17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
     5255 Status.DIFFERENT Reason.TITLE_FILENAME
     2451 Status.AMBIGUOUS Reason.APPENDIX
     1946 Status.STRONG Reason.FIGSHARE_VERSION
     1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
      798 Status.DIFFERENT Reason.NUM_DIFF
      463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
      125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
       18 Status.STRONG Reason.CUSTOM_BSI_UNDATED
       18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
        7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC

286M positive links.

$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc
286008492

Or 175M, if we exclude DOI and work matches.

$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc
175547235

The final derivation dep tree looks like:

 $ ./tasks.py -d BiblioRef
 \_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
    \_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
       \_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
          \_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ ReleaseExportExpanded()
                   \_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                      \_ Input()
    \_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
       \_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
          \_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
             \_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
          \_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
             \_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
          \_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ ReleaseExportExpanded()
             \_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ Input()
          \_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
             \_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ ReleaseExportExpanded()
             \_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                \_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
                   \_ Input()