aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/version_2.md
blob: 873b5bf224dd850a8f5e6db7fc246a74ca96edba (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# Version 2 (2021-02-18)

As target document we want, as per `proposals/2021-01-29_citation_api.md` the following:

```
BiblioRef ("bibliographic reference")
    _key: Optional[str] elasticsearch doc key
        ("release", source_release_ident, ref_index)
        ("wikipedia", source_wikipedia_article, ref_index)
    update_ts: Optional[datetime] elasticsearch doc timestamp

    # metadata about source of reference
    source_release_ident: Optional[str]
    source_work_ident: Optional[str]
    source_wikipedia_article: Optional[str]
        with lang prefix like "en:Superglue"
    # skipped: source_openlibrary_work
    # skipped: source_url_surt
    source_release_stage: Optional[str]
    source_year: Optional[int]

    # context of the reference itself
    ref_index: int
        1-indexed, not 0-indexed
    ref_key: Optional[str]
        eg, "Lee86", "BIB23"
    ref_locator: Optional[str]
        eg, page number

    # target of reference (identifiers)
    target_release_ident: Optional[str]
    target_work_ident: Optional[str]
    target_openlibrary_work: Optional[str]
    target_url_surt: Optional[str]
    target_url: Optional[str]
        would not be stored in elasticsearch, but would be auto-generated
        by all "get" methods from the SURT, so calling code does not need
        to do SURT transform
    # skipped: target_wikipedia_article

    match_provenance: str
        crossref, pubmed, grobid, etc
    match_status: Optional[str]
        strong, weak, etc
        TODO: "match_strength"?
    match_reason: Optional[str]
        "doi", "isbn", "fuzzy title, author", etc
        maybe "fuzzy-title-author"?

    target_unstructured: string (only if no release_ident link/match)
    target_csl: free-form JSON (only if no release_ident link/match)
        CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
        generated from unstructured by a GROBID parse, if needed
```

This resulting docs/index will be generated from various pipelines:

* various identifier joins (doi, pmid, pmcid, arxiv, ...)
* a fuzzy matching pipeline
* a wikipedia "scan" over publications, by DOI, title, direct link
* an open library "scan", matching possibly ISBN or book titles against the catalog
* relating a source document to all its referenced web pages (as `target_url`)

The raw inputs:

* release export (expanded or minimized)
* an aggregated list of references
* wikipedia dumps, e.g. en, de, fr, es, ...
* an openlibrary dump
* auxiliary data structures, e.g. journal name lookup database (abbreviations), etc.
* MAG, base, aminer, and other datasets to run comparisons against

# Setup and deployment

* [-] clone this repo
* [x] copy "zipapp"
* [x] setup raw inputs in settings.ini
* [x] run task

Using shiv for creating single-file deployment. Single config file. Handle to
list and inspect files. Keep it minimal. External tools in skate.

----

# Match with more complete data

* [x] more sensible changing between inputs (e.g. sample, full, etc.)

For joins.

* [x] reduce release entities to minimum (ReleaseEntityReduced)

Reduced 120G to 48G, big win (stipping files, refs, and container extra); 154203375 docs (12min to count)

* [ ] extract not to (ident, value), but (ident, value, doc) or the like
* [ ] the joined row should contain both md blobs to generate fuller schema

Zipped Merge

We need:

* refs to releases, derive key, sort
* reduced releases, derive key, sort

* [ ] sort fatcat and refs by key
* [ ] zipped iteration over both docs (and run verify)

----

# Other datasets

* [ ] https://archive.org/details/enwiki-20210120, example: https://archive.org/download/enwiki-20210120/enwiki-20210120-pages-articles-multistream11.xml-p6899367p7054859.bz2

----

## Zipped Verification

* beside a one blob per line model, we can run a "comm" like procedure to verify group (or run any other routine on groups)

Advantages of zip mode:

* only need to generate any sorted dataset; we can save the "group by" transform
* easier to carry the whole doc around, which is what we want, to generate a
  more complete result document

```
$ skate-verify -m zip -R <(zstdcat -T0 /bigger/.cache/refcat/FatcatSortedKeys/dataset-full-date-2021-02-20.json.zst) \
    -F <(zstdcat -T0 /bigger/.cache/refcat/RefsSortedKeys/dataset-full-date-2021-02-20.json.zst)
```

A basic framework in Go for doing zipped iteration.

* we need the generic (id, key, doc) format, maybe just a jq tweak

----

Example size increase by carrying data to the key matching step; about 10x (3 to 30G compressed).

----

* Putting pieces together:

* 620,626,126 DOI "join"
* 23,280,469 fuzzy
* 76,382,408 pmid
* 49,479 pmcid
* 3,011,747 arxiv

COCI/crossref has currently:

* 759,516,507 citation links.
* we: ~723,350,228

```
$ zstdcat -T0 /bigger/.cache/refcat/BiblioRefV1/dataset-full-date-2021-02-20.json.zst|LC_ALL=C wc
717435777 717462400 281422956549
```

----

Some notes on unparsed data:

```
    "unstructured": "S. F. Fischer and A. Laubereau, Chem. Phys. Lett. 55, 189 (1978).CHPLBC0009-2614"

$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst| jq
-rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid ==
null and .biblio.unstructured != null) | .biblio.unstructured' | head -1000000
| grep -c -E ' [0-9]{1,3}-[0-9]{1,3}'
```

* 4400/100000; 5% of 500M would still be 25M?



* pattern matching?

```
$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst | jq -rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid == null and .biblio.unstructured != null) | .biblio.unstructured'
```

Data lineage for "v2":

```
$ refcat.pyz deps BiblioRefV2
 \_ BiblioRefV2(dataset=full, date=2021-02-20)
    \_ BiblioRefZippyPMID(dataset=full, date=2021-02-20)
       \_ FatcatPMID(dataset=full, date=2021-02-20)
          \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
             \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
       \_ RefsPMID(dataset=full, date=2021-02-20)
          \_ Refs(dataset=full, date=2021-02-20)
    \_ BiblioRefFromFuzzyClusters(dataset=full, date=2021-02-20)
       \_ RefsFatcatClusters(dataset=full, date=2021-02-20)
          \_ RefsFatcatSortedKeys(dataset=full, date=2021-02-20)
             \_ RefsReleasesMerged(dataset=full, date=2021-02-20)
                \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
                   \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
                \_ RefsToRelease(dataset=full, date=2021-02-20)
                   \_ Refs(dataset=full, date=2021-02-20)
    \_ BiblioRefZippyPMCID(dataset=full, date=2021-02-20)
       \_ RefsPMCID(dataset=full, date=2021-02-20)
          \_ Refs(dataset=full, date=2021-02-20)
       \_ FatcatPMCID(dataset=full, date=2021-02-20)
          \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
             \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
    \_ BiblioRefZippyDOI(dataset=full, date=2021-02-20)
       \_ FatcatDOI(dataset=full, date=2021-02-20)
          \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
             \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
       \_ RefsDOI(dataset=full, date=2021-02-20)
          \_ Refs(dataset=full, date=2021-02-20)
    \_ BiblioRefZippyArxiv(dataset=full, date=2021-02-20)
       \_ RefsArxiv(dataset=full, date=2021-02-20)
          \_ Refs(dataset=full, date=2021-02-20)
       \_ FatcatArxiv(dataset=full, date=2021-02-20)
          \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
             \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
```