aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/wikipedia_citations_2020-07-14.md
blob: 7eca0254915cecef77d78848cd6d3df7bc9cc22d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# Notes on wikipedia_citations_2020-07-14

* https://archive.org/details/wikipedia_citations_2020-07-14
* https://zenodo.org/record/3940692
* https://github.com/Harshdeep1996/cite-classifications-wiki

```
.
├── [6.6G]  citations_from_wikipedia.zip
├── [819M]  lookup_data.zip
├── [1.4G]  minimal_dataset.zip
├── [ 91K]  wikipedia_citations_2020-07-14_archive.torrent
├── [2.0K]  wikipedia_citations_2020-07-14_files.xml
├── [ 20K]  wikipedia_citations_2020-07-14_meta.sqlite
└── [1.3K]  wikipedia_citations_2020-07-14_meta.xml
```

Using `parquet-tools cat --json`
(https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-line)
to convert to json.

About 1442176 DOI, 1027006 unique.

Most referenced on WP:

```
   4393 10.1073/pnas.242603899
   3182 10.1101/gr.2596504
   2307 10.24436/2
   2079 10.1038/ng1285
   1447 10.1007/BF00171763
   1357 10.1051/0004-6361:20078357
   1346 10.1038/nature04209
   1293 10.1016/0378-1119(94)90802-8
   1246 10.1016/S0378-1119(97)00411-3
    927 10.1111/j.1096-3642.2005.00153.x
    738 10.1016/j.cell.2006.09.026
    657 10.1101/gr.4039406
    631 10.1101/gr.6.9.791
    607 10.1038/msb4100134
    602 10.1101/gr.143000
    591 10.5194/hess-11-1633-2007
    531 10.1101/gr.GR1547R
    492 10.1080/002229300299282
    480 10.1101/gr.2576704
    460 10.1093/nar/gkj139
```

* https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md
* `source_wikipedia_article: Optional[str]`


About 29M citations. IDlist uses various id types:

```
$ cat minimal_dataset.json |
    jq -rc 'select(.ID_list != null) | .ID_list' |
    tr ',' '\n' |
    tr -d '{}' |
    sed -e 's@^ *@@' |
    cut -d '=' -f 1 |
    sort |
    uniq -c |
    sort -nr
```

Except artifacts:

```
2160818 ISBN
1442176 DOI
 825970 PMID
 353425 ISSN
 279369 PMC
 185742 OCLC
 181375 BIBCODE
 110920 JSTOR
  47601 ARXIV
  15202 LCCN
  12878 MR
   8270 ASIN
   6293 OL
   3790 SSRN
   3013 ZBL
    413 OSTI
    357 JFM
    277 USENETID
     85 RFC
     78 ISMN
```