Version 1
Includes:
- doi, pmid, pmcid, arxiv
- title-lower exact matches
Title join yields 16B+ matches (16761492658), since we have many generic rows, e.g. "introduction". 180G compressed, about 53 min for a one pass.
$ LC_ALL=C time join -t ' ' -1 2 -2 2 <(zstdcat FatcatTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) \
<(zstdcat RefsTitlesLower/sha1-ef1756a5856085807742966f48d95b4cb00299a0.tsv.zst) | zstd -c > title.tsv.zst
Filter and sample with awk
, e.g. via:
$ zstdcat -T0 title.tsv.zst | LC_ALL=C grep -E '^[[:alnum:]]' | awk 'length($1) > 30' | awk 'NR%1000==0'
Need to pre-filter before join, to keep join smaller.
Basic inspection of the "exact lower title" set.
- 16B+ candidates
- as the join keys are already sorted, we can run uniq
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C cut -f 1 | LC_ALL=C pv -l | LC_ALL=C uniq -c | zstd -c > title_counts.tsv.zst
real 92m28.442s
user 142m49.627s
sys 46m9.473s
Some manual sampling:
Different release, but same references (585):
- https://fatcat.wiki/release/zvd5r6grcvd6tnmeovijvx4soq/references
- https://fatcat.wiki/release/4zutv5pmhjgs7nfvqy2zws6icm/references
There are duplicates in the join, need to filter them out.
$ time zstdcat -T0 title.tsv.zst | LC_ALL=C uniq | LC_ALL=C pv -l | zstd -T0 -c > title_uniq.tsv.zst
Left with about 13B uniq.
OCI, example:
- https://opencitations.net/index/coci/api/v1/citations/10.1056/nejmoa1606220
- OCI: 646 citations
we have 356 via doi, pmid, about 112 via title, 468 total; which one do we miss?
However, we do have all but one of the OCI DOIs in fatcat:
$ jq -r '.[].citing' oci_v1_10_1056_nejmoa1606220.json | tigris-doi > oci_v1_10_1056_nejmoa1606220_lookup.json
Example, DOI not in OCI:
- https://opencitations.net/index/coci/api/v1/citations/10.14236/ewic/eva2014.30
Possible mitigations:
- ignore common titles
- ignore numbers only
Examples: 42
appeards 3816 times
Harder cases:
- "41st annual meeting" - too generic, and wrong
Generic DOI lookup from OCI in fatcat:
$ curl -sL https://opencitations.net/index/coci/api/v1/citations/10.1016/j.cell.2010.03.012 | jq -rc '.[].citing' | tigris-doi -w 256 | jq -rc .
{"doi":"10.1530/erc-16-0228","status":200}
{"doi":"10.1371/journal.pone.0080023","status":200}
{"doi":"10.1074/jbc.m114.566141","status":200}
...
Overall:
- 31344136 unique titles
most common join title:
- 11,939,631,644 introduction
- also: "science", "preface", "book reviews", ..., "cell", ...
Filtering:
$ zstdcat -T0 title_counts.tsv.zst | \
LC_ALL=C awk '($1 > 5000 && length($0) < 30) || ($1 > 15000 && length($0) < 40)'
About 7275 titles to filter out, e.g.
...
475300 abstracts of papers
20502 ac
13892 aca
7881 academic freedom
...
5047 community policing
157176 community-acquired pneumonia
68222 commutative algebra
5512 comorbidity
5516 compact stars
8865 company
...
7353 facebook
6461 facial pain
8977 facilities
5238 facing the future
5064 fact
11198 fact sheet
...
Trying fuzzycat clustering, with 0.1.13, which allows to compress -C
intermediate artifacts.
$ time zstdcat \
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python -m \
fuzzycat cluster -t tsandcrawler -C' | pv -l | zstd -T0 -c > cluster.ndj.zst
Using fuzzycat 0.1.13 with compression; all fine until:
$ time zstdcat \
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | pv \
-l | parallel -j 16 --block 10M --roundrobin --pipe 'TMPDIR=/bigger/tmp python \
-m fuzzycat clust er -t tsandcrawler -C' | zstd -T0 -c > cluster.ndj.zst
1.58G 6:35:39 [66.5k/s] [ <=> ]
parallel: Error: Output is incomplete.
parallel: Error: Cannot append to buffer file in /tmp.
parallel: Error: Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
real 1013m20.128s
user 2696m14.290s
sys 119m29.419s
A run with --compress
and --tmpdir
set on parallel worked:
$ time zstdcat
RefsReleasesMerged/sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst |
parallel --compress --tmpdir /fast/tmp -j 4 --block 10M --roundrobin --pipe
'TMPDIR=/bigger/tmp python -m fuzzycat cluster -t tsandcrawler -C' |
zstd -T0 -c > cluster.ndj.zst
real 1301m26.206s
user 2778m20.635s
sys 140m32.121s
- 21h, finds 5850385 clusters (seems too low)
Sample generation
Created samples, filtered by years (1895, 1955, 1995, 2015) for refs and releases:
- ~114M refs
- ~7M releases
Adjusted tasks.py
to use a different sha1 and updated settings.ini with
sample file locations.
First clustering
Key extraction (KE), sorting and clustering took 14h, when the merged dataset is already there (it takes ~80min to convert refs to releases, plus a bit more to concatenate the files).
$ ./run.sh RefsFatcatClusters
real 841m45.169s
user 2872m35.481s
sys 561m14.231s
Resulting file is 154G compressed.
Cluster count and sizes:
$ zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
LC_ALL=C pv -l | LC_ALL=C jq -rc '[(.v|length), .k] | @tsv' > sizes.tsv
Follow up tasks:
- each cluster will have ref and non-ref items
- we want at least one non-ref item
$ skate-cluster -both ...
Will keep only those clusters that contain at least one ref and one non-ref entry.
Found 40257623 clusters, iteration over the 89GB compressed file takes 28min.
Raw synopsis:
$ zstdcat sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | \
jq -c -C 'select(.v|length == 2) | [(.v[] | [.ext_ids.doi[0:2], .title[0:10], .ident, .extra.skate.status == "ref"])]' | less -r
Some numbers:
- [ ] number of 2-clusters, where not both entries have a doi?
Verification.
- needed a different batch verifier, since we do not need pairwise comparisons;
$ cut -d ' ' -f 3-4 cluster_ref_verify.tsv | LC_ALL=C sort -S20% | uniq -c | sort -nr
8390899 Status.DIFFERENT Reason.YEAR
6191622 Status.EXACT Reason.DOI
5468805 Status.STRONG Reason.JACCARD_AUTHORS
3848964 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
3306728 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
1263329 Status.STRONG Reason.TOKENIZED_AUTHORS
424441 Status.AMBIGUOUS Reason.UNKNOWN
199157 Status.EXACT Reason.TITLE_AUTHOR_MATCH
138144 Status.AMBIGUOUS Reason.SHORT_TITLE
92054 Status.DIFFERENT Reason.PAGE_COUNT
25122 Status.AMBIGUOUS Reason.BLACKLISTED
22964 Status.EXACT Reason.WORK_ID
17702 Status.STRONG Reason.VERSIONED_DOI
16236 Status.DIFFERENT Reason.COMPONENT
14462 Status.STRONG Reason.PREPRINT_PUBLISHED
9632 Status.STRONG Reason.PMID_DOI_PAIR
3429 Status.STRONG Reason.ARXIV_VERSION
3288 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
729 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
195 Status.STRONG Reason.FIGSHARE_VERSION
76 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
74 Status.DIFFERENT Reason.TITLE_FILENAME
43 Status.DIFFERENT Reason.NUM_DIFF
22 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
11 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
1 Status.STRONG Reason.CUSTOM_BSI_UNDATED
Guessing: Maybe 30% "strong", so maybe ~120M new edges?
Manual sampling and issues
https://fatcat.wiki/release/tiqp3w67sjhzdorc6whizpnbyy https://fatcat.wiki/release/lbmqfamyoveldeyvv5xktq5ayi Status.DIFFERENT Reason.YEAR
Grobid output:
<biblStruct xml:id="b77">
<analytic>
<title level="a" type="main">The Social Construction of Planning Systems: A Strategic-Relational Institutionalist Approach</title>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">L</forename><surname>Servillo</surname></persName>
</author>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Van Den</surname></persName>
</author>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">P</forename><surname>Broeck</surname></persName>
</author>
<idno type="DOI">10.1080/02697459.2012.661179></idno>
<idno>En lĂnea] 2012 [Fecha de consulta: 21 de agosto 2015</idno>
<ptr target="<http://dx.doi.org/10.1080/02697459.2012.661179>" />
</analytic>
<monogr>
<title level="j">En: Planning Practice and Research</title>
<imprint>
<biblScope unit="volume">27</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="41" to="61" />
</imprint>
</monogr>
</biblStruct>
There are dates, but not explicit clean 2012.
Another issue:
https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
Very similar titles:
"... nephrotic syndrome in childhood" vs "... nephrotic syndrome in childred" ...
- year do not match, but fuzzycat does not check from that (1995, vs 2004 in the refs)
Intermediate match results:
141970958 Status.DIFFERENT Reason.YEAR
106734288 Status.EXACT Reason.DOI
91205561 Status.STRONG Reason.JACCARD_AUTHORS
66894403 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
53693804 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
20889423 Status.STRONG Reason.TOKENIZED_AUTHORS
7449880 Status.AMBIGUOUS Reason.UNKNOWN
3507120 Status.EXACT Reason.TITLE_AUTHOR_MATCH
1199761 Status.DIFFERENT Reason.PAGE_COUNT
1121611 Status.AMBIGUOUS Reason.SHORT_TITLE
395710 Status.EXACT Reason.WORK_ID
362089 Status.DIFFERENT Reason.COMPONENT
351654 Status.AMBIGUOUS Reason.BLACKLISTED
326730 Status.STRONG Reason.VERSIONED_DOI
239924 Status.STRONG Reason.PREPRINT_PUBLISHED
171594 Status.STRONG Reason.PMID_DOI_PAIR
54646 Status.STRONG Reason.ARXIV_VERSION
49248 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
17135 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
5219 Status.DIFFERENT Reason.TITLE_FILENAME
2451 Status.AMBIGUOUS Reason.APPENDIX
1874 Status.STRONG Reason.FIGSHARE_VERSION
1231 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
774 Status.DIFFERENT Reason.NUM_DIFF
448 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
123 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
17 Status.STRONG Reason.CUSTOM_BSI_UNDATED
17 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
6 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
Another false negative:
- https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a
- http://real.mtak.hu/78943/1/acs.jctc.8b00072.pdf, https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji
https://fatcat.wiki/release/sqrld55t4zdrhf23oq75azo67a https://fatcat.wiki/release/gx7owpu4gbcglfwlyzdh5qlfji Status.DIFFERENT Reason.YEAR
Both docs contain 1972?
<biblStruct xml:id="b67">
<analytic>
<title level="a" type="main">Variational Wavefunctions for H2 +</title>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">F</forename><surname>Weinhold</surname></persName>
</author>
</analytic>
<monogr>
<title level="j">J. Chem. Phys</title>
<imprint>
<biblScope unit="volume">56</biblScope>
<biblScope unit="page" from="3798" to="3801" />
<date type="published" when="1972" />
</imprint>
</monogr>
</biblStruct>
Running:
$ time zstdcat -T0 sha1-ef1756a5856085807742966f48d95b4cb00299a0.json.zst | parallel --tmpdir /bigger/tmp --blocksize 4M --pipe -j 16 'python -m fuzzycat verify_ref' > clus
ter_ref_verify.tsv
resulted in a 69GB tsv file and took 3056m5.322s (~50h), 512033197 comparisons.
Stats:
$ TMPDIR=/bigger/tmp LC_ALL=C time zstdcat -T0
cluster_ref_verify_2021_02_16.tsv.zst | cut -d ' ' -f 3-4 | TMPDIR=/bigger/tmp
LC_ALL=C sort -S20% | uniq -c | sort -nr
146095427 Status.DIFFERENT Reason.YEAR
110052214 Status.EXACT Reason.DOI
94300998 Status.STRONG Reason.JACCARD_AUTHORS
68986574 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY
55199653 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH
21545821 Status.STRONG Reason.TOKENIZED_AUTHORS
7746937 Status.AMBIGUOUS Reason.UNKNOWN
3626713 Status.EXACT Reason.TITLE_AUTHOR_MATCH
1265506 Status.DIFFERENT Reason.PAGE_COUNT
1171178 Status.AMBIGUOUS Reason.SHORT_TITLE
409043 Status.EXACT Reason.WORK_ID
374051 Status.DIFFERENT Reason.COMPONENT
356772 Status.AMBIGUOUS Reason.BLACKLISTED
336588 Status.STRONG Reason.VERSIONED_DOI
249723 Status.STRONG Reason.PREPRINT_PUBLISHED
177547 Status.STRONG Reason.PMID_DOI_PAIR
56445 Status.STRONG Reason.ARXIV_VERSION
51776 Status.STRONG Reason.CUSTOM_IEEE_ARXIV
17887 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
5255 Status.DIFFERENT Reason.TITLE_FILENAME
2451 Status.AMBIGUOUS Reason.APPENDIX
1946 Status.STRONG Reason.FIGSHARE_VERSION
1263 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN
798 Status.DIFFERENT Reason.NUM_DIFF
463 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916
125 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT
18 Status.STRONG Reason.CUSTOM_BSI_UNDATED
18 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288
7 Status.STRONG Reason.CUSTOM_BSI_SUBDOC
286M positive links.
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | awk '{print $1}' | paste -sd+ | bc
286008492
Or 175M, if we exclude DOI and work matches.
$ grep -E "Status.STRONG|Status.EXACT" version_1_fuzzy_stats.txt | grep -Ev "Reason.DOI|Reason.WORK_ID" | awk '{print $1}' | paste -sd+ | bc
175547235
The final derivation dep tree looks like:
$ ./tasks.py -d BiblioRef
\_ BiblioRef(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ BiblioRefFuzzy(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatClusterVerify(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatClusters(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatSortedKeys(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsReleasesMerged(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsToRelease(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ BiblioRefFromJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatGroupJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsFatcatPMCIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ FatcatPMCID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsFatcatArxivJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsArxiv(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ RefsFatcatPMIDJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsPMID(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()
\_ RefsFatcatDOIJoin(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ FatcatDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ ReleaseExportExpanded()
\_ RefsDOIsLower(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ RefsDOIs(sha1=ef1756a5856085807742966f48d95b4cb00299a0)
\_ Input()