# V3

V2 plus:

* [ ] wikipedia
* [ ] some unstrucutured refs
* [ ] OL
* [ ] weblinks

## Unstructured

* about 300M w/o title, etc.
* some docs mention a "doi" in "unstructured"

Possible extractable information:

* pages ranges with regex
* doi, isbn, issn
* author names with some NER?
* journal abbreviation

Numbers:

$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | LC_ALL=C pv -l | LC_ALL=C grep -c -i "doi"
2772622

Sometimes, the key contains an ISBN:

```
"key":"9781108604222#EMT-rl-1_BIBe-r-213"
```

key with doi:

```
"index":63,"key":"10.1002/9781118960608.gbm01177-BIB6970|gbm01177-cit-6970","locator":"7
```

ISBN format:

* 978-9279113639
* 9781566773362
* 978-80-7357-299-0

URLs may be broken:

```
http://www. unaids.org/hivaidsinfo/statistics/fact_sheets/pdfs/Thailand_en.pdf
```

* 2030021 DOI
* 36376 arxiv

Some cases only contain authors and year, e.g.

```
{
  "biblio": {
    "contrib_raw_names": [
      "W H Hartmann",
      "B H Hahn",
      "H Abbey",
      "L E Shulman"
    ],
    "unstructured": "Hartmann, W. H., Hahn, B. H., Abbey, H., and Shulman, L. E., Lancer, 1965, 1, 123.",
    "year": 1965
  },
```

Here, we could run a query, e.g.
https://fatcat.wiki/release/search?q=hahn+shulman+abbey+hartmann, and check for
result set size, year, etc.

Other example:

* https://fatcat.wiki/release/search?q=Goudie+Anderson+Gray+boyle+buchanan+year%3A1965

```
{
  "biblio": {
    "contrib_raw_names": [
      "R B Goudie",
      "J R Anderson",
      "K G Gray",
      "J A Boyle",
      "W W Buchanar"
    ],
    "unstructured": "Goudie, R. B., Anderson, J. R., Gray, K. G., Boyle, J. A., and Buchanar, W. W., ibid., 1965, 1, 322.",
    "year": 1965
  },
```

----

With `skate-from-unstructured` we get some more doi and arxiv identifiers from
unstructured refs (unstructured, key). How many?

```
$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | pv -l | \
    skate-from-unstructured | jq -rc 'select(.biblio.doi != null or .biblio.arxiv_id != null)' | wc -l
```

The https://anystyle.io/ CRF implementation seems really useful to parse out
the rest of the unstructured data.

* [ ] parse fields with some containerized anystyle (create an oci container
  and somehow get it running w/ or w/o docker; maybe podman allows to run as
library?)

Example:

```
$ anystyle -f json parse xxxx.txt
[
  {
    "citation-number": [
      "3. "
    ],
    "author": [
      {
        "family": "JP",
        "given": "Morgan"
      },
      {
        "family": "CS",
        "given": "Bailey"
      }
    ],
    "title": [
      "Cauda equina syndrome in the dog: radiographical evaluation"
    ],
    "volume": [
      "21"
    ],
    "pages": [
      "45 – 58"
    ],
    "type": "article-journal",
    "container-title": [
      "J Small Anim Practice"
    ],
    "date": [
      "1980"
    ]
  }
]
```

Can dump the whole unstructured list in to a single file (one per line).

* 10K lines take: 32s
* 100M would take probably ~100h to parse.

----

* from 308 "UnmatchedRefs" we would extract doi/arxiv for 47696153.

Stats:

* 759,516,507 citation links.
* ~723,350,228 + 47,696,153
* 771046381 edges

----

* aitio has docker installed

```
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: false
```

Maybe build an alpine based image?

Both anystyle and grobid use wapiti under the hood; but they seem to differ
slightly. anystyle seems to be a smaller codebase overall. Grobid has an api
and various modes.

Note-to-self: Run a comparison between wapiti based citation extractors.

----

```
$ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL=C wc -l
260768384
```

----

# Wikipedia

* /magna/data/wikipedia_citations_2020-07-14

A first run only got 64008 docs; improbable that we are missing so many doi.

Also, need to generalize some skate code a bit.

----

# Verification stats

* have 40257623 clusters, `zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l`
* have X cluster of size less than 10

```
$ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst |
    jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l
```

A 5M sample.

```
$ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr
6886124 StatusDifferent
4619805 StatusStrong
3587478 StatusExact
 120215 StatusAmbiguous
```