aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/coci_notes.md
blob: b6e2b0e83f61cfb5f9b3e20265b7ba0d58ebc397 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# COCI Notes

* [https://opencitations.net/download](https://opencitations.net/download)
* [https://figshare.com/articles/dataset/Crossref_Open_Citation_Index_CSV_dataset_of_all_the_citation_data/6741422/9](https://figshare.com/articles/dataset/Crossref_Open_Citation_Index_CSV_dataset_of_all_the_citation_data/6741422/9)

> 6741422v9.zip [19G]

> Dump created on 2020-12-07. This dump includes information on:

* 60,778,357 bibliographic resources;
* 759,516,507 citation links.


```
extracted/2020-06-13T18_18_05_1-2.zip
extracted/2020-08-20T18_12_28_1-2.zip
extracted/2020-04-25T04_48_36_1-5.zip
extracted/2020-11-22T17_48_01_1-3.zip
extracted/2020-01-13T19_31_19_1-4.zip
extracted/2019-10-21T22_41_20_1-63.zip
```

* extracted to 79 CSV files

Raw data example.

```
oci,citing,cited,creation,timespan,journal_sc,author_sc
02003080406360106010101060909370200010237070005020502-02001000106361937231430122422370200000837000737000200,10.3846/16111699.2012.705252,10.1016/j.neucom.2008.07.020,2012-10-04,P3Y0M,no,no
02003080406360106010101060909370200010237070005020502-0200308040636010601016301060909370200000837093701080963010908,10.3846/16111699.2012.705252,10.3846/1611-1699.2008.9.189-198,2012-10-04,P4Y0M4D,yes,no
02003080406360106010101060909370200010237070005020502-02001000106361937102818141224370200000737000237000003,10.3846/16111699.2012.705252,10.1016/j.asieco.2007.02.003,2012-10-04,P5Y6M,no,no
02003080406360106010101060909370200010237070005020502-02003080406360106010101060909370200010137050505030808,10.3846/16111699.2012.705252,10.3846/16111699.2011.555388,2012-10-04,P1Y5M22D,yes,no
...
```

For comparison, we need also a DOI-DOI matching list.

Example approach:

* extract source-target release ident, sort by source ident
* from fatcat db dump, extract source id and ext ids, sort by source ident
* "zip together"

Unify CSV files:

```
$ zstdcat -T0 6741422v9.csv.zst | wc -l
759516506
```

Nomenclature:

* citing = source
* cited = target

Example:

```
10.3846/16111699.2012.720591,10.1016/0024-6301(96)00041-6
```

> citing: 10.3846/16111699.2012.720591, https://fatcat.wiki/release/52znjflg2bdd5h2q2icu3zjhki
> cited: 10.1016/0024-6301(96)00041-6, https://fatcat.wiki/release/mz6dkakhknd47h3skd7ttomwga

```
$ curl -s "localhost:9200/fatcat_ref_v02_20210716/_search?q=source_release_ident:52znjflg2bdd5h2q2icu3zjhki+AND+target_release_ident:mz6dkakhknd47h3skd7ttomwga" | jq .
{
  "took": 259,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 32.16953,
    "hits": [
      {
        "_index": "fatcat_ref_v02_20210716",
        "_type": "_doc",
        "_id": "52znjflg2bdd5h2q2icu3zjhki_2",
        "_score": 32.16953,
        "_source": {
          "indexed_ts": "2021-07-10T12:04:57Z",
          "match_provenance": "crossref",
          "match_reason": "doi",
          "match_status": "exact",
          "ref_index": 2,
          "ref_key": "cit0005",
          "source_release_ident": "52znjflg2bdd5h2q2icu3zjhki",
          "source_work_ident": "76yenkekovfh5bnvuxwvtvxy5q",
          "source_year": "2014",
          "target_release_ident": "mz6dkakhknd47h3skd7ttomwga",
          "target_work_ident": "um37w3kdcnhqvnp5jeh3mvhumy"
        }
      }
    ]
  }
}
```