where should code for this live? fatcat-scholar, sandcrawler, fatcat?
preferably fatcat repo I guess.
today (2020-09-04), 29.7 million releases have some refs in fatcat, and an
additional 20 million have fulltext in fatcat (50 million total). there are
969,541,971 total references, so expecting something on the order of 1.6
billion references output.
only about 3.75 million references coming from wikipedia (en) with a persistent
identifier.
first version of tool cruises in single thread at 330 docs/sec, or about 1 mil/hour
zcat /grande/snapshots/fatcat_scholar_work_fulltext.split_00.json.gz | head -n1000000 | pv -l | python -m fatcat_scholar.transform run_refs > /bigger/scholar/fatcat_scholar_work_fulltext.split_00.1m.refs.json
1M 0:53:13 [ 313 /s]
wc -l /bigger/scholar/fatcat_scholar_work_fulltext.split_00.1m.refs.json
=> 9,242,758
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .release_ident -r | sort -u | wc -l
=> 282,568
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .ref_source -r | sort | uniq -c | sort -nr
4760872 crossref
2847014 grobid
735459 pubmed
683909 datacite
215504 fatcat (probably GROBID, jstage, etc)
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .biblio.url -r | rg -v '^null$' | rg wikipedia.org > wikpedia_urls.tsv
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .biblio.url -r | rg -v '^null$' | rg wikipedia.org | wc -l
=> 523
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .biblio.url -r | rg -v '^null$' | rg archive.org > archive_urls.tsv
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .biblio.url -r | rg -v '^null$' | rg archive.org | wc -l
=> 122
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .biblio.pmid -r | rg -v '^null$' | wc -l
=> 500036
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .biblio.doi -r | rg -v '^null$' | wc -l
=> 3636175
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .biblio.url -r | rg -v '^null$' | rg -v doi.org/ | wc -l
cat fatcat_scholar_work_fulltext.split_00.1m.refs.json | jq .biblio.url -r | rg -v '^null$' | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n50
35233 doi.org
26263 dx.doi.org
1539 www.ncbi.nlm.nih.gov
1518
843 arxiv.org
821 www.who.int
670 www.scielo.br
642 goo.gl
463 ec.europa.eu
449 bit.ly
434 www.cdc.gov
431 www.jstor.org
425 www.sciencedirect.com
372 www.planalto.gov.br
356 en.wikipedia.org
331 www.w3.org
308 creativecommons.org
306 www.youtube.com
295 www.nytimes.com
278 ssrn.com
[...]
TODO:
x year/date of the *citing* document
x 'unstructured' in biblio
x contrib_raw names Optional (to save on key storage space)
x basic URL cleanup
x GROBID *and* fatcat for all lines (?)
x more GROBID refs fields (grobid2json)
biblStruct
analytic
x arXiv:nucl-th/0007068
x 10.5354/0719-3769.1979.16458
x PMC3112331
x 16330524
ISBN 0-674- 21298-3
x
imprint
x Cambridge University Press
x
x 18346083
arXiv preprint
x
- debug pubmed refs
- title seems to not come through from fatcat
resources:
- https://git.archive.org/webgroup/fatcat/-/blob/bnewbold-citation-graph-brainstorm/proposals/202008_bulk_citation_graph.md
- https://docs.citationstyles.org/en/1.0.1/specification.html#appendix-iv-variables
- https://guide.fatcat.wiki/entity_release.html
- https://analytics.wikimedia.org/published/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/
open questions:
- how many citations? how large is this corpus on-disk?
1mil => 2.6gb (uncompressed)
150mil => 390gb (uncompressed)
- what fraction...
have an external identifier (quick match)
look like junk
have a URL
- how many references to wikipedia? assuming via URL
- how many references to IA services
=> archive.org, web.archive.org, archive-it, openlibrary.org, etc
=> top resources
----------
running larger batch:
zcat /grande/snapshots/fatcat_scholar_work_fulltext.split_01.json.gz | pv -l | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.transform run_refs > /bigger/scholar/fatcat_scholar_work_fulltext.split_01.refs.json
=> 24.7M 5:33:27 [1.23k/s]
123G Sep 14 12:56 fatcat_scholar_work_fulltext.split_01.refs.json
pigz fatcat_scholar_work_fulltext.split_01.refs.json
du -sh fatcat_scholar_work_fulltext.split_01.refs.json.gz
24G fatcat_scholar_work_fulltext.split_01.refs.json.gz
zcat fatcat_scholar_work_fulltext.split_01.refs.json.gz | wc -l
285,551,233
Expecting a bit below 2 billion references; though actually because of
duplication.
This really blows up in size, presumably because things like release+work
idents don't compress well and are duplicated for each reference. JSON overhead
should almost entirely compress away.
Let's do the rest of these, so we can upload as a corpus (estimate 168 GByte
compressed).
zcat /grande/snapshots/fatcat_scholar_work_fulltext.split_00.json.gz | pv -l | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.transform run_refs | gzip > /bigger/scholar/fatcat_scholar_work_fulltext.split_00.refs.json
zcat /grande/snapshots/fatcat_scholar_work_fulltext.split_02.json.gz | pv -l | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.transform run_refs | gzip > /bigger/scholar/fatcat_scholar_work_fulltext.split_02.refs.json
zcat /grande/snapshots/fatcat_scholar_work_fulltext.split_03.json.gz | pv -l | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.transform run_refs | gzip > /bigger/scholar/fatcat_scholar_work_fulltext.split_03.refs.json
zcat /grande/snapshots/fatcat_scholar_work_fulltext.split_04.json.gz | pv -l | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.transform run_refs | gzip > /bigger/scholar/fatcat_scholar_work_fulltext.split_04.refs.json
zcat /grande/snapshots/fatcat_scholar_work_fulltext.split_05.json.gz | pv -l | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.transform run_refs | gzip > /bigger/scholar/fatcat_scholar_work_fulltext.split_05.refs.json
zcat /grande/snapshots/fatcat_scholar_work_fulltext.split_06.json.gz | pv -l | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.transform run_refs | gzip > /bigger/scholar/fatcat_scholar_work_fulltext.split_06.refs.json
Something went wrong part-way through split_04:
5.55M 1:12:30 [1.12k/s]
[...]
json.decoder.JSONDecodeError: Expecting value: line 1 column 853 (char 852)
Disk corruption or something? This parsed fine with a very similar command
recently. Hrm. Re-ran again and was successful.