# V3

V2 plus:

* [ ] no dups
* [ ] unmatched
* [ ] wikipedia
* [ ] some unstrucutured refs
* [ ] OL
* [ ] weblinks

## Duplicates

```
$ zstdcat -T0 /magna/refcat/BiblioRefV2/date-2021-02-20.json.zst | jq -rc 'select(.source_release_ident == .target_release_ident)'
```

Only 0.001% though.

## Unstructured

* about 300M w/o title, etc.
* some docs mention a "doi" in "unstructured"

Possible extractable information:

* pages ranges with regex
* doi, isbn, issn
* author names with some NER?
* journal abbreviation

Numbers:

$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | LC_ALL=C pv -l | LC_ALL=C grep -c -i "doi"
2772622

Sometimes, the key contains an ISBN:

```
"key":"9781108604222#EMT-rl-1_BIBe-r-213"
```

key with doi:

```
"index":63,"key":"10.1002/9781118960608.gbm01177-BIB6970|gbm01177-cit-6970","locator":"7
```

ISBN format:

* 978-9279113639
* 9781566773362
* 978-80-7357-299-0

URLs may be broken:

```
http://www. unaids.org/hivaidsinfo/statistics/fact_sheets/pdfs/Thailand_en.pdf
```

* 2030021 DOI
* 36376 arxiv

Some cases only contain authors and year, e.g.

```
{
  "biblio": {
    "contrib_raw_names": [
      "W H Hartmann",
      "B H Hahn",
      "H Abbey",
      "L E Shulman"
    ],
    "unstructured": "Hartmann, W. H., Hahn, B. H., Abbey, H., and Shulman, L. E., Lancer, 1965, 1, 123.",
    "year": 1965
  },
```

Here, we could run a query, e.g.
https://fatcat.wiki/release/search?q=hahn+shulman+abbey+hartmann, and check for
result set size, year, etc.

Other example:

* https://fatcat.wiki/release/search?q=Goudie+Anderson+Gray+boyle+buchanan+year%3A1965

```
{
  "biblio": {
    "contrib_raw_names": [
      "R B Goudie",
      "J R Anderson",
      "K G Gray",
      "J A Boyle",
      "W W Buchanar"
    ],
    "unstructured": "Goudie, R. B., Anderson, J. R., Gray, K. G., Boyle, J. A., and Buchanar, W. W., ibid., 1965, 1, 322.",
    "year": 1965
  },
```

----

With `skate-from-unstructured` we get some more doi and arxiv identifiers from
unstructured refs (unstructured, key). How many?

```
$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | pv -l | \
    skate-from-unstructured | jq -rc 'select(.biblio.doi != null or .biblio.arxiv_id != null)' | wc -l
```

The https://anystyle.io/ CRF implementation seems really useful to parse out
the rest of the unstructured data.

* [ ] parse fields with some containerized anystyle (create an oci container
  and somehow get it running w/ or w/o docker; maybe podman allows to run as
library?)

Example:

```
$ anystyle -f json parse xxxx.txt
[
  {
    "citation-number": [
      "3. "
    ],
    "author": [
      {
        "family": "JP",
        "given": "Morgan"
      },
      {
        "family": "CS",
        "given": "Bailey"
      }
    ],
    "title": [
      "Cauda equina syndrome in the dog: radiographical evaluation"
    ],
    "volume": [
      "21"
    ],
    "pages": [
      "45 – 58"
    ],
    "type": "article-journal",
    "container-title": [
      "J Small Anim Practice"
    ],
    "date": [
      "1980"
    ]
  }
]
```

Can dump the whole unstructured list in to a single file (one per line).

* 10K lines take: 32s
* 100M would take probably ~100h to parse.

----

* from 308 "UnmatchedRefs" we would extract doi/arxiv for 47696153.

Stats:

* 759,516,507 citation links.
* ~723,350,228 + 47,696,153
* 771046381 edges

----

* aitio has docker installed

```
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: false
```

Maybe build an alpine based image?

Both anystyle and grobid use wapiti under the hood; but they seem to differ
slightly. anystyle seems to be a smaller codebase overall. Grobid has an api
and various modes.

Note-to-self: Run a comparison between wapiti based citation extractors.

----

```
$ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL=C wc -l
260768384
```

----

# Wikipedia

* /magna/data/wikipedia_citations_2020-07-14

A first run only got 64008 docs; improbable that we are missing so many doi.

Also, need to generalize some skate code a bit.

----

# Verification stats

* have 40257623 clusters, `zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l`
* have 29290668 clusters of size <= 10

```
$ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst |
    jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l
```

A 5M sample.

```
$ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr
6886124 StatusDifferent
4619805 StatusStrong
3587478 StatusExact
 120215 StatusAmbiguous
```

----

# Unmatched

* We want the unmatched refs as well, e.g. to display.

In order to do that offline, we would need to sort all matches by source and
the original refs file by source ident.

The iterate over both files and fill in the unmatched targets (unstructured, csl, ...)

Options:

* we have `source ident` and `ref_index` (+1)
* can sort biblioref by source ident
* can sort refs by source ident

That's almost the same, as the matching process, just another function working on the match group.

----

# OpenLibrary

* has isbns, 10, 13
* how many isbn in refs (?)


Sidenote, also in refs:

```
"title": "B l e u m e r, M. S t r a u s s. Divertible Protocols and Atomic Proxy Cryptography",
```

How many titles have "s p a c e s" in title?

----

ISBN normalization.

In refs, we mostly have ISBN in unstrcutured:

```
ISBN 3-906166-35-X.
ISBN 978-0- 470-25003-7.
Austria. ISBN 3-900051-07-0, URL 962 http://www.R-project.org. (2007).
ISBN 88-13-19785-3
ISBN GB3N-CL4-5HL4.
```

About 600/1M "isbn" in unstructured.

```
$ zstdcat -T0 fatcat_scholar_work_fulltext.refs.json.zst | head -1000000 | jq .biblio.unstructured | grep -c -i isbn
675
```

So maybe 500k isbn in total?

* need to find them, then validate them

----

## Notes on URLList

* about 25M urls
* about 11075871 seem to have a "doi"

----

A subtle bug: a doi in refs ends with tab:

```
10.1002/andp.19975090102\t
```

----

## URL lookup via pig

* failed after a week; map spill

```
2021-05-21 15:04:25,507 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete
^C2021-05-24 15:22:57,073 [Thread-6] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ia802401.us.archive.org/207.241.228.181:6932
2021-05-24 15:22:58,245 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 64% complete
2021-05-24 15:22:58,778 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 71% complete
2021-05-24 15:23:02,763 [Thread-6] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Job job_pigexec_0 killed

real    8276m35.071s
user    425m6.748s
sys     52m21.012s
```