aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/wikipedia_citations_2020-07-14.md
blob: d079312b101a234e43177702195de34b28ff6659 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Notes on wikipedia_citations_2020-07-14

* https://archive.org/details/wikipedia_citations_2020-07-14
* https://zenodo.org/record/3940692
* https://github.com/Harshdeep1996/cite-classifications-wiki

```
.
├── [6.6G]  citations_from_wikipedia.zip
├── [819M]  lookup_data.zip
├── [1.4G]  minimal_dataset.zip
├── [ 91K]  wikipedia_citations_2020-07-14_archive.torrent
├── [2.0K]  wikipedia_citations_2020-07-14_files.xml
├── [ 20K]  wikipedia_citations_2020-07-14_meta.sqlite
└── [1.3K]  wikipedia_citations_2020-07-14_meta.xml
```

Using `parquet-tools cat --json`
(https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-line)
to convert to json.

About 1442176 DOI, 1027006 unique.

Most referenced on WP:

```
   4393 10.1073/pnas.242603899
   3182 10.1101/gr.2596504
   2307 10.24436/2
   2079 10.1038/ng1285
   1447 10.1007/BF00171763
   1357 10.1051/0004-6361:20078357
   1346 10.1038/nature04209
   1293 10.1016/0378-1119(94)90802-8
   1246 10.1016/S0378-1119(97)00411-3
    927 10.1111/j.1096-3642.2005.00153.x
    738 10.1016/j.cell.2006.09.026
    657 10.1101/gr.4039406
    631 10.1101/gr.6.9.791
    607 10.1038/msb4100134
    602 10.1101/gr.143000
    591 10.5194/hess-11-1633-2007
    531 10.1101/gr.GR1547R
    492 10.1080/002229300299282
    480 10.1101/gr.2576704
    460 10.1093/nar/gkj139
```

* https://git.archive.org/webgroup/fatcat/-/blob/10eb30251f89806cb7a0f147f427c5ea7e5f9941/proposals/2021-01-29_citation_api.md
* `source_wikipedia_article: Optional[str]`