# V3 V2 plus: * [ ] no dups * [ ] unmatched * [ ] wikipedia * [ ] some unstrucutured refs * [ ] OL * [ ] weblinks ## Duplicates ``` $ zstdcat -T0 /magna/refcat/BiblioRefV2/date-2021-02-20.json.zst | jq -rc 'select(.source_release_ident == .target_release_ident)' ``` Only 0.001% though. ## Unstructured * about 300M w/o title, etc. * some docs mention a "doi" in "unstructured" Possible extractable information: * pages ranges with regex * doi, isbn, issn * author names with some NER? * journal abbreviation Numbers: $ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | LC_ALL=C pv -l | LC_ALL=C grep -c -i "doi" 2772622 Sometimes, the key contains an ISBN: ``` "key":"9781108604222#EMT-rl-1_BIBe-r-213" ``` key with doi: ``` "index":63,"key":"10.1002/9781118960608.gbm01177-BIB6970|gbm01177-cit-6970","locator":"7 ``` ISBN format: * 978-9279113639 * 9781566773362 * 978-80-7357-299-0 URLs may be broken: ``` http://www. unaids.org/hivaidsinfo/statistics/fact_sheets/pdfs/Thailand_en.pdf ``` * 2030021 DOI * 36376 arxiv Some cases only contain authors and year, e.g. ``` { "biblio": { "contrib_raw_names": [ "W H Hartmann", "B H Hahn", "H Abbey", "L E Shulman" ], "unstructured": "Hartmann, W. H., Hahn, B. H., Abbey, H., and Shulman, L. E., Lancer, 1965, 1, 123.", "year": 1965 }, ``` Here, we could run a query, e.g. https://fatcat.wiki/release/search?q=hahn+shulman+abbey+hartmann, and check for result set size, year, etc. Other example: * https://fatcat.wiki/release/search?q=Goudie+Anderson+Gray+boyle+buchanan+year%3A1965 ``` { "biblio": { "contrib_raw_names": [ "R B Goudie", "J R Anderson", "K G Gray", "J A Boyle", "W W Buchanar" ], "unstructured": "Goudie, R. B., Anderson, J. R., Gray, K. G., Boyle, J. A., and Buchanar, W. W., ibid., 1965, 1, 322.", "year": 1965 }, ``` ---- With `skate-from-unstructured` we get some more doi and arxiv identifiers from unstructured refs (unstructured, key). How many? ``` $ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | pv -l | \ skate-from-unstructured | jq -rc 'select(.biblio.doi != null or .biblio.arxiv_id != null)' | wc -l ``` The https://anystyle.io/ CRF implementation seems really useful to parse out the rest of the unstructured data. * [ ] parse fields with some containerized anystyle (create an oci container and somehow get it running w/ or w/o docker; maybe podman allows to run as library?) Example: ``` $ anystyle -f json parse xxxx.txt [ { "citation-number": [ "3. " ], "author": [ { "family": "JP", "given": "Morgan" }, { "family": "CS", "given": "Bailey" } ], "title": [ "Cauda equina syndrome in the dog: radiographical evaluation" ], "volume": [ "21" ], "pages": [ "45 – 58" ], "type": "article-journal", "container-title": [ "J Small Anim Practice" ], "date": [ "1980" ] } ] ``` Can dump the whole unstructured list in to a single file (one per line). * 10K lines take: 32s * 100M would take probably ~100h to parse. ---- * from 308 "UnmatchedRefs" we would extract doi/arxiv for 47696153. Stats: * 759,516,507 citation links. * ~723,350,228 + 47,696,153 * 771046381 edges ---- * aitio has docker installed ``` Client: Version: 17.06.0-ce API version: 1.30 Go version: go1.8.3 Git commit: 02c1d87 Built: Fri Jun 23 21:23:31 2017 OS/Arch: linux/amd64 Server: Version: 17.06.0-ce API version: 1.30 (minimum version 1.12) Go version: go1.8.3 Git commit: 02c1d87 Built: Fri Jun 23 21:19:04 2017 OS/Arch: linux/amd64 Experimental: false ``` Maybe build an alpine based image? Both anystyle and grobid use wapiti under the hood; but they seem to differ slightly. anystyle seems to be a smaller codebase overall. Grobid has an api and various modes. Note-to-self: Run a comparison between wapiti based citation extractors. ---- ``` $ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL=C wc -l 260768384 ``` ---- # Wikipedia * /magna/data/wikipedia_citations_2020-07-14 A first run only got 64008 docs; improbable that we are missing so many doi. Also, need to generalize some skate code a bit. ---- # Verification stats * have 40257623 clusters, `zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l` * have 29290668 clusters of size <= 10 ``` $ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l ``` A 5M sample. ``` $ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr 6886124 StatusDifferent 4619805 StatusStrong 3587478 StatusExact 120215 StatusAmbiguous ``` ---- # Unmatched * We want the unmatched refs as well, e.g. to display. In order to do that offline, we would need to sort all matches by source and the original refs file by source ident. The iterate over both files and fill in the unmatched targets (unstructured, csl, ...) Options: * we have `source ident` and `ref_index` (+1) * can sort biblioref by source ident * can sort refs by source ident That's almost the same, as the matching process, just another function working on the match group. ---- # OpenLibrary * has isbns, 10, 13 * how many isbn in refs (?) Sidenote, also in refs: ``` "title": "B l e u m e r, M. S t r a u s s. Divertible Protocols and Atomic Proxy Cryptography", ``` How many titles have "s p a c e s" in title? ---- ISBN normalization. In refs, we mostly have ISBN in unstrcutured: ``` ISBN 3-906166-35-X. ISBN 978-0- 470-25003-7. Austria. ISBN 3-900051-07-0, URL 962 http://www.R-project.org. (2007). ISBN 88-13-19785-3 ISBN GB3N-CL4-5HL4. ``` About 600/1M "isbn" in unstructured. ``` $ zstdcat -T0 fatcat_scholar_work_fulltext.refs.json.zst | head -1000000 | jq .biblio.unstructured | grep -c -i isbn 675 ``` So maybe 500k isbn in total? * need to find them, then validate them ---- ## Notes on URLList * about 25M urls * about 11075871 seem to have a "doi" ---- A subtle bug: a doi in refs ends with tab: ``` 10.1002/andp.19975090102\t ``` ---- ## URL lookup via pig * failed after a week; map spill ``` 2021-05-21 15:04:25,507 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete ^C2021-05-24 15:22:57,073 [Thread-6] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ia802401.us.archive.org/207.241.228.181:6932 2021-05-24 15:22:58,245 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 64% complete 2021-05-24 15:22:58,778 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 71% complete 2021-05-24 15:23:02,763 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Job job_pigexec_0 killed real 8276m35.071s user 425m6.748s sys 52m21.012s ```