aboutsummaryrefslogtreecommitdiffstats
path: root/notes/2022_01_10_refcat_update.md
blob: a7de46c410849bb4095584013e81346a1c6fb74b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Refcat update

* new refs export, about 10% more (2.7B)
* new fatcat export

New wikipedia extraction:

```
martin@ia601101:/magna/data/wikipedia_citations_2020-07-14 $ LC_ALL=C grep ID_list minimal_dataset.json | grep -c DOI
1442189

$ jq -rc '.refs[] | select(.ID_list != null) | {"URL": .URL, "Title": .title, "ID_list": .ID_list}' enwiki-20211201-pages-articles.citations.json | pv -l  > minimal.json
$ grep -c DOI minimal.json
1932578
```

Convert format to existing minimal format, for "BrefZipWikiDOI" task.

First result, bref combined.

Previous version:

```
$ time zstdcat -T0 date-2021-07-28.json.zst |pv -l|wc -lc
2.08G 0:45:56 [ 753k/s] [                                                                                                                                                                                    <=>                              ]
2077597833 981406745860
```

Current:

```
$ zstdcat -T0 date-2022-01-03.json.zst | pv -l | wc -lc
2.28G 0:37:55 [1.00M/s] [                                                                                                    <=>                                                                                                              ]
2282864413 1077436490574
```

* 2,282,864,413 edges (matched and unmatched)
* 1,077,436,490,574 / 1T

About 11G more compressed, about 80G more data; estimated (from 100M sample)
1.483B matches (ratio, 0.65)

Previous (v1):

* 1,323,423,672 - estimate based on filesize: 1.439B matches.

Current (v2):

* 1,481,079,426 (76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy)

Diff:

* about 12% increase in number of edges
* latest (v12) OCI: 1,235,170,583 (so refcat about 19% larger with 1,481,079,426)