# V3 V2 plus: * [ ] wikipedia * [ ] some unstrucutured refs * [ ] OL * [ ] weblinks ## Unstructured * about 300M w/o title, etc. * some docs mention a "doi" in "unstructured" Possible extractable information: * pages ranges with regex * doi, isbn, issn * author names with some NER? * journal abbreviation Numbers: $ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | LC_ALL=C pv -l | LC_ALL=C grep -c -i "doi" 2772622 Sometimes, the key contains an ISBN: ``` "key":"9781108604222#EMT-rl-1_BIBe-r-213" ``` key with doi: ``` "index":63,"key":"10.1002/9781118960608.gbm01177-BIB6970|gbm01177-cit-6970","locator":"7 ``` ISBN format: * 978-9279113639 * 9781566773362 * 978-80-7357-299-0 URLs may be broken: ``` http://www. unaids.org/hivaidsinfo/statistics/fact_sheets/pdfs/Thailand_en.pdf ``` * 2030021 DOI * 36376 arxiv Some cases only contain authors and year, e.g. ``` { "biblio": { "contrib_raw_names": [ "W H Hartmann", "B H Hahn", "H Abbey", "L E Shulman" ], "unstructured": "Hartmann, W. H., Hahn, B. H., Abbey, H., and Shulman, L. E., Lancer, 1965, 1, 123.", "year": 1965 }, ``` Here, we could run a query, e.g. https://fatcat.wiki/release/search?q=hahn+shulman+abbey+hartmann, and check for result set size, year, etc. Other example: * https://fatcat.wiki/release/search?q=Goudie+Anderson+Gray+boyle+buchanan+year%3A1965 ``` { "biblio": { "contrib_raw_names": [ "R B Goudie", "J R Anderson", "K G Gray", "J A Boyle", "W W Buchanar" ], "unstructured": "Goudie, R. B., Anderson, J. R., Gray, K. G., Boyle, J. A., and Buchanar, W. W., ibid., 1965, 1, 322.", "year": 1965 }, ``` ---- With `skate-from-unstructured` we get some more doi and arxiv identifiers from unstructured refs (unstructured, key). How many? ``` $ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | pv -l | \ skate-from-unstructured | jq -rc 'select(.biblio.doi != null or .biblio.arxiv_id != null)' | wc -l ``` The https://anystyle.io/ CRF implementation seems really useful to parse out the rest of the unstructured data. * [ ] parse fields with some containerized anystyle (create an oci container and somehow get it running w/ or w/o docker; maybe podman allows to run as library?) Example: ``` $ anystyle -f json parse xxxx.txt [ { "citation-number": [ "3. " ], "author": [ { "family": "JP", "given": "Morgan" }, { "family": "CS", "given": "Bailey" } ], "title": [ "Cauda equina syndrome in the dog: radiographical evaluation" ], "volume": [ "21" ], "pages": [ "45 – 58" ], "type": "article-journal", "container-title": [ "J Small Anim Practice" ], "date": [ "1980" ] } ] ``` Can dump the whole unstructured list in to a single file (one per line). * 10K lines take: 32s * 100M would take probably ~100h to parse. ---- * from 308 "UnmatchedRefs" we would extract doi/arxiv for 47696153. Stats: * 759,516,507 citation links. * ~723,350,228 + 47,696,153 * 771046381 edges ---- * aitio has docker installed ``` Client: Version: 17.06.0-ce API version: 1.30 Go version: go1.8.3 Git commit: 02c1d87 Built: Fri Jun 23 21:23:31 2017 OS/Arch: linux/amd64 Server: Version: 17.06.0-ce API version: 1.30 (minimum version 1.12) Go version: go1.8.3 Git commit: 02c1d87 Built: Fri Jun 23 21:19:04 2017 OS/Arch: linux/amd64 Experimental: false ``` Maybe build an alpine based image? Both anystyle and grobid use wapiti under the hood; but they seem to differ slightly. anystyle seems to be a smaller codebase overall. Grobid has an api and various modes. Note-to-self: Run a comparison between wapiti based citation extractors. ---- ``` $ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL=C wc -l 260768384 ``` ---- # Wikipedia * /magna/data/wikipedia_citations_2020-07-14 A first run only got 64008 docs; improbable that we are missing so many doi. Also, need to generalize some skate code a bit. ---- # Verification stats * have 40257623 clusters, `zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l` * have X cluster of size less than 10 ``` $ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l ``` A 5M sample. ``` $ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr 6886124 StatusDifferent 4619805 StatusStrong 3587478 StatusExact 120215 StatusAmbiguous ```