# Notes on wikipedia_citations_2020-07-14 * https://archive.org/details/wikipedia_citations_2020-07-14 * https://zenodo.org/record/3940692 * https://github.com/Harshdeep1996/cite-classifications-wiki ``` . ├── [6.6G] citations_from_wikipedia.zip ├── [819M] lookup_data.zip ├── [1.4G] minimal_dataset.zip ├── [ 91K] wikipedia_citations_2020-07-14_archive.torrent ├── [2.0K] wikipedia_citations_2020-07-14_files.xml ├── [ 20K] wikipedia_citations_2020-07-14_meta.sqlite └── [1.3K] wikipedia_citations_2020-07-14_meta.xml ``` Using `parquet-tools cat --json` (https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-line) to convert to json. About 1442176 DOI, 1027006 unique. Most referenced on WP: ``` 4393 10.1073/pnas.242603899 3182 10.1101/gr.2596504 2307 10.24436/2 2079 10.1038/ng1285 1447 10.1007/BF00171763 1357 10.1051/0004-6361:20078357 1346 10.1038/nature04209 1293 10.1016/0378-1119(94)90802-8 1246 10.1016/S0378-1119(97)00411-3 927 10.1111/j.1096-3642.2005.00153.x 738 10.1016/j.cell.2006.09.026 657 10.1101/gr.4039406 631 10.1101/gr.6.9.791 607 10.1038/msb4100134 602 10.1101/gr.143000 591 10.5194/hess-11-1633-2007 531 10.1101/gr.GR1547R 492 10.1080/002229300299282 480 10.1101/gr.2576704 460 10.1093/nar/gkj139 ``` * https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md * `source_wikipedia_article: Optional[str]` About 29M citations. IDlist uses various id types: ``` $ cat minimal_dataset.json | jq -rc 'select(.ID_list != null) | .ID_list' | tr ',' '\n' | tr -d '{}' | sed -e 's@^ *@@' | cut -d '=' -f 1 | sort | uniq -c | sort -nr ``` Except artifacts: ``` 2160818 ISBN 1442176 DOI 825970 PMID 353425 ISSN 279369 PMC 185742 OCLC 181375 BIBCODE 110920 JSTOR 47601 ARXIV 15202 LCCN 12878 MR 8270 ASIN 6293 OL 3790 SSRN 3013 ZBL 413 OSTI 357 JFM 277 USENETID 85 RFC 78 ISMN ```