diff options
Diffstat (limited to 'notes/misc/2022-04_missing_oa.md')
| -rw-r--r-- | notes/misc/2022-04_missing_oa.md | 202 | 
1 files changed, 202 insertions, 0 deletions
diff --git a/notes/misc/2022-04_missing_oa.md b/notes/misc/2022-04_missing_oa.md new file mode 100644 index 00000000..9a5541b9 --- /dev/null +++ b/notes/misc/2022-04_missing_oa.md @@ -0,0 +1,202 @@ + +Short data exploration of what OA content is missing, and how it might be crawled. + +Starting with "front page" query: + +    is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084 + +    doi_prefix:10.6084 is figshare +    doi_prefix:10.5281 is zenodo + +    14,658,673	66.56%	preserved and publicly accessible (bright) +    3,453,052	15.68%	preserved but not publicly accessible (dark) +    3,911,614	17.77%	no known independent preservation +    22,023,339	100%	total + +Virtually all of the "dark" is also `in_shadows:true`. So the +`preservation:none` is the high-impact target for crawling. + +Limiting to `publisher_type:big5`, almost zero `preservation:none`, and 1.34 +million (41%) dark. + +## Publisher Type + +Created a kibana graph of the above filters, graphing `publisher_type` ("Publisher Type breakdown of missing OA)": + +    <missing>   1769k   54% +    longtail     852k   26% +    society      195k    6% +    unipress     130k    4% +    scielo       114k    3.5% +    then: repository, oa, commercial, big5 + +## Containers + +    !container_id:* preservation:none is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084 + +    1,993,639 missing preservation + +These are virtually all Datacite DOIs (not including figshare/zenodo), and +start in 2008, ramping up. They are almost all missing `publisher_type` (which +makes sense because they have no container). + +With the filters from above, here are some top containers missing content: + +    Missing	                    1,993,639 +    e27twid5qnbqbboxlkrja2xz2a	12,537 +        "Proceedings of Indian National Science Academy" +        almost zero preservation. DOAJ website is 404 for article (!), no longer in DOAJ (!) +        some kind of bad metadata situation? almost all from 2015 +    fmoqnzpewvfrnm2ni4mbvvlney	9,350 +        "Chinese Medical Journal" +        PMIDs only +        missing/unpreserved is pre-2015 (significant!) +    7l5xye7sc5emxfprwmqw2a7yxq	8,999 +        "Tidsskrift for Den norske legeforening" (norwegian medical) +        bunch of PMIDs only; sporadic preservation coverage +    ujftxdg3knebxhrqg4qjznz2he	5,903 +        "International Research Journal" (russian) +        these are by-issue, with DOIs redirecting to pages inside issue (!) +    kfzef6kfwbhpnfw3cifit7zw7q	5,678 +        "lectures" +        hosted on openeditions +        HTML ingest would work (!) +    gr4g5qzzcnembf4om6yjb6qf34	5,020 +        "计算机科学" +        mostly via dblp. some DOIs, presumably chinese? +    bl77onlbbbhu5d6ohpjw2ypojy	4,994 +        "EOS" (from American Geophysical Union / AGU) +        large publication, mostly preserved (dark) +        mix of wiley.com OA (but hard to crawl?) and web/HTML stuff +    3afvqhtpnjd5nmiphwxlxzirde	4,877 +        "Medical Science Monitor" +        large publication, mixed preservation +        annoying PDF link situation (hard to crawl?) +    tulajqojzjabfc4iybyv6poi2e	4,786 +        "Dermatology Online Journal" +        large publication, mixed preservation +        some just pmid +        some HTML or ePub-only +        escholarship.org + +A take-away here for me is that containers are pretty heterogenous and have +diverse issues. + +TODO: ingest things like: https://escholarship.org/uc/item/02v86610 +    from container_tulajqojzjabfc4iybyv6poi2e + +### revues.org / openedition + +Many of these seem like they would ingest fine via HTML. + +    doi_prefix:10.4000 + +    151,565	34.3%	preserved and publicly accessible (bright) +      7,211	1.64%	preserved but not publicly accessible (dark) +    283,139	64.08%	no known independent preservation +    441,915	100%	total + +    article-journal	230,146	    63% preserved +    chapter	        200,724	     2% preserved +    book	        10,971	    12% preserved +    paper-conference	74 + +Chapters and books don't seem as amenable to ingest... and indeed are mostly +not marked `is_oa:true`. + +DONE: bulk html-mode ingest, expecting about 80k requests: + +    doi_prefix:10.4000 in_ia:false type:article-journal is_oa:true + +    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc280.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc350.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-bulk \ +        --ingest-type html \ +        query "doi_prefix:10.4000 in_ia:false type:article-journal is_oa:true" +    => Expecting 80032 release objects in search queries +    => Counter({'ingest_request': 80032, 'elasticsearch_release': 80032, 'estimate': 80032, 'kafka': 80032}) + +NOTE: have this be the default ingest type for this DOI prefix? not sure, some +do come through as PDF just fine + +## Source of Records + +Starting with the 3,844,142 or so `preservation:none`. + +    doi                 3.204m +        datacite            1.995m +        crossref            1.087m +        <unknown>           109k +        jalc                12k +    doaj_id             553k +    pmid                192k +    dblp_id             29k +    arxiv_id, pmcid     0 + +I'm surprised how good dblp coverage is? Oh, but those are almost entirely +missing OA status, that explains it. + +    # NOTE: not specifically OA +    dblp_id:* year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) + +    406,235	    22.54%	preserved and publicly accessible (bright) +    59,009	    3.28%	preserved but not publicly accessible (dark) +    1,337,554	74.2%	no known independent preservation +    1,802,798	100%	total + +Looks like doi and DOAJ are big sources. + +    # NOTE: DOAJ implies OA, I checked and numbers are ~same +    doaj_id:* is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) + +    588,364	    47.27%	preserved and publicly accessible (bright) +    103,206	    8.3%	preserved but not publicly accessible (dark) +    553,353	    44.45%	no known independent preservation +    1,244,923	100%	total + +DOAJ ingest seems important to optimize! + +    !publisher_type:big5 container_id:* doaj_id:* is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) +    => 548,709 missing preservation + +    doaj_id:* +    => 589,915 missing preservation + +Datacite the biggest category though, even with zenodo/figshare removed. + +TODO: largest datacite DOI prefixes +TODO: check sandcrawler DB to see DOAJ ingest status; maybe these are entirely missing URLs? or just not crawling well? +TODO: dig in to "longtail" more... some random ones? + +## Largest DOI Prefixes + +    <missing>	640,104 +    10.48550 	1,543,167 +        the new arxiv.org prefix +    10.4000	68,267 +        revues / openedition (handled above) +    10.25384	60,063 +        figshare / SAGE +    10.3917	52,195 +        cairn.info +    10.25673	41,565 +        some random IR? opendata.uni-halle.de +        TODO: ingest this type of item, possibly using dataset->file crawler +    10.3406	33,778 +        persee.fr +        blocks bots (don't attempt ingest) +    10.3205	33,540 +        "german medical science" +        HTML articles, PDF links +        TODO: fix ingest +        https://www.egms.de/static/en/journals/gms/2020-18/000284.shtml +    10.17605	30,365 +        osf.io +        TODO: fix ingest (?) +    10.25446	26,614 +        figshare / oxford +        "File(s) not publicly available" +        but "CC BY 4.0"? ugh + +TODO: HTML crawl cairn.info (10.3917) +TODO: ignore 10.25384, 10.25446 (figshare) +TODO: ignore arixv.org prefix (10.48550) in default dashboard +TODO: handle arxiv.org DOIs better (merge, count as preserved, etc)  | 
