V3

V2 plus:

[ ] no dups
[ ] unmatched
[ ] wikipedia
[ ] some unstrucutured refs
[ ] OL
[ ] weblinks

Duplicates

$ zstdcat -T0 /magna/refcat/BiblioRefV2/date-2021-02-20.json.zst | jq -rc 'select(.source_release_ident == .target_release_ident)'

Only 0.001% though.

Unstructured

about 300M w/o title, etc.
some docs mention a "doi" in "unstructured"

Possible extractable information:

pages ranges with regex
doi, isbn, issn
author names with some NER?
journal abbreviation

Numbers:

$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | LC_ALL=C pv -l | LC_ALL=C grep -c -i "doi" 2772622

Sometimes, the key contains an ISBN:

"key":"9781108604222#EMT-rl-1_BIBe-r-213"

key with doi:

"index":63,"key":"10.1002/9781118960608.gbm01177-BIB6970|gbm01177-cit-6970","locator":"7

ISBN format:

978-9279113639
9781566773362
978-80-7357-299-0

URLs may be broken:

http://www. unaids.org/hivaidsinfo/statistics/fact_sheets/pdfs/Thailand_en.pdf

2030021 DOI
36376 arxiv

Some cases only contain authors and year, e.g.

{
  "biblio": {
    "contrib_raw_names": [
      "W H Hartmann",
      "B H Hahn",
      "H Abbey",
      "L E Shulman"
    ],
    "unstructured": "Hartmann, W. H., Hahn, B. H., Abbey, H., and Shulman, L. E., Lancer, 1965, 1, 123.",
    "year": 1965
  },

Here, we could run a query, e.g. https://fatcat.wiki/release/search?q=hahn+shulman+abbey+hartmann, and check for result set size, year, etc.

Other example:

https://fatcat.wiki/release/search?q=Goudie+Anderson+Gray+boyle+buchanan+year%3A1965

{
  "biblio": {
    "contrib_raw_names": [
      "R B Goudie",
      "J R Anderson",
      "K G Gray",
      "J A Boyle",
      "W W Buchanar"
    ],
    "unstructured": "Goudie, R. B., Anderson, J. R., Gray, K. G., Boyle, J. A., and Buchanar, W. W., ibid., 1965, 1, 322.",
    "year": 1965
  },

With skate-from-unstructured we get some more doi and arxiv identifiers from unstructured refs (unstructured, key). How many?

$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | pv -l | \
    skate-from-unstructured | jq -rc 'select(.biblio.doi != null or .biblio.arxiv_id != null)' | wc -l

The https://anystyle.io/ CRF implementation seems really useful to parse out the rest of the unstructured data.

[ ] parse fields with some containerized anystyle (create an oci container and somehow get it running w/ or w/o docker; maybe podman allows to run as library?)

Example:

$ anystyle -f json parse xxxx.txt
[
  {
    "citation-number": [
      "3. "
    ],
    "author": [
      {
        "family": "JP",
        "given": "Morgan"
      },
      {
        "family": "CS",
        "given": "Bailey"
      }
    ],
    "title": [
      "Cauda equina syndrome in the dog: radiographical evaluation"
    ],
    "volume": [
      "21"
    ],
    "pages": [
      "45 – 58"
    ],
    "type": "article-journal",
    "container-title": [
      "J Small Anim Practice"
    ],
    "date": [
      "1980"
    ]
  }
]

Can dump the whole unstructured list in to a single file (one per line).

10K lines take: 32s
100M would take probably ~100h to parse.

from 308 "UnmatchedRefs" we would extract doi/arxiv for 47696153.

Stats:

759,516,507 citation links.
~723,350,228 + 47,696,153
771046381 edges

aitio has docker installed

Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: false

Maybe build an alpine based image?

Both anystyle and grobid use wapiti under the hood; but they seem to differ slightly. anystyle seems to be a smaller codebase overall. Grobid has an api and various modes.

Note-to-self: Run a comparison between wapiti based citation extractors.

$ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL=C wc -l
260768384

Wikipedia

/magna/data/wikipedia_citations_2020-07-14

A first run only got 64008 docs; improbable that we are missing so many doi.

Also, need to generalize some skate code a bit.

Verification stats

have 40257623 clusters, zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l
have 29290668 clusters of size <= 10

$ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst |
    jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l

A 5M sample.

$ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr
6886124 StatusDifferent
4619805 StatusStrong
3587478 StatusExact
 120215 StatusAmbiguous

Unmatched

We want the unmatched refs as well, e.g. to display.

In order to do that offline, we would need to sort all matches by source and the original refs file by source ident.

The iterate over both files and fill in the unmatched targets (unstructured, csl, ...)

Options:

we have source ident and ref_index (+1)
can sort biblioref by source ident
can sort refs by source ident

That's almost the same, as the matching process, just another function working on the match group.

OpenLibrary

has isbns, 10, 13
how many isbn in refs (?)

Sidenote, also in refs:

"title": "B l e u m e r, M. S t r a u s s. Divertible Protocols and Atomic Proxy Cryptography",

How many titles have "s p a c e s" in title?

ISBN normalization.

In refs, we mostly have ISBN in unstrcutured:

ISBN 3-906166-35-X.
ISBN 978-0- 470-25003-7.
Austria. ISBN 3-900051-07-0, URL 962 http://www.R-project.org. (2007).
ISBN 88-13-19785-3
ISBN GB3N-CL4-5HL4.

About 600/1M "isbn" in unstructured.

$ zstdcat -T0 fatcat_scholar_work_fulltext.refs.json.zst | head -1000000 | jq .biblio.unstructured | grep -c -i isbn
675

So maybe 500k isbn in total?

need to find them, then validate them

Notes on URLList

about 25M urls
about 11075871 seem to have a "doi"

A subtle bug: a doi in refs ends with tab:

10.1002/andp.19975090102\t

URL lookup via pig

failed after a week; map spill

2021-05-21 15:04:25,507 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete
^C2021-05-24 15:22:57,073 [Thread-6] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ia802401.us.archive.org/207.241.228.181:6932
2021-05-24 15:22:58,245 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 64% complete
2021-05-24 15:22:58,778 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 71% complete
2021-05-24 15:23:02,763 [Thread-6] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Job job_pigexec_0 killed

real    8276m35.071s
user    425m6.748s
sys     52m21.012s