blob: 7eca0254915cecef77d78848cd6d3df7bc9cc22d (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
|
# Notes on wikipedia_citations_2020-07-14
* https://archive.org/details/wikipedia_citations_2020-07-14
* https://zenodo.org/record/3940692
* https://github.com/Harshdeep1996/cite-classifications-wiki
```
.
├── [6.6G] citations_from_wikipedia.zip
├── [819M] lookup_data.zip
├── [1.4G] minimal_dataset.zip
├── [ 91K] wikipedia_citations_2020-07-14_archive.torrent
├── [2.0K] wikipedia_citations_2020-07-14_files.xml
├── [ 20K] wikipedia_citations_2020-07-14_meta.sqlite
└── [1.3K] wikipedia_citations_2020-07-14_meta.xml
```
Using `parquet-tools cat --json`
(https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-line)
to convert to json.
About 1442176 DOI, 1027006 unique.
Most referenced on WP:
```
4393 10.1073/pnas.242603899
3182 10.1101/gr.2596504
2307 10.24436/2
2079 10.1038/ng1285
1447 10.1007/BF00171763
1357 10.1051/0004-6361:20078357
1346 10.1038/nature04209
1293 10.1016/0378-1119(94)90802-8
1246 10.1016/S0378-1119(97)00411-3
927 10.1111/j.1096-3642.2005.00153.x
738 10.1016/j.cell.2006.09.026
657 10.1101/gr.4039406
631 10.1101/gr.6.9.791
607 10.1038/msb4100134
602 10.1101/gr.143000
591 10.5194/hess-11-1633-2007
531 10.1101/gr.GR1547R
492 10.1080/002229300299282
480 10.1101/gr.2576704
460 10.1093/nar/gkj139
```
* https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md
* `source_wikipedia_article: Optional[str]`
About 29M citations. IDlist uses various id types:
```
$ cat minimal_dataset.json |
jq -rc 'select(.ID_list != null) | .ID_list' |
tr ',' '\n' |
tr -d '{}' |
sed -e 's@^ *@@' |
cut -d '=' -f 1 |
sort |
uniq -c |
sort -nr
```
Except artifacts:
```
2160818 ISBN
1442176 DOI
825970 PMID
353425 ISSN
279369 PMC
185742 OCLC
181375 BIBCODE
110920 JSTOR
47601 ARXIV
15202 LCCN
12878 MR
8270 ASIN
6293 OL
3790 SSRN
3013 ZBL
413 OSTI
357 JFM
277 USENETID
85 RFC
78 ISMN
```
|