aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--notes/misc/2020-08_metadata.md79
-rw-r--r--notes/misc/2020_ingest_ideas.md14
-rw-r--r--notes/misc/2022-04_missing_oa.md202
-rw-r--r--notes/misc/UNSORTED.txt (renamed from notes/UNSORTED.txt)0
-rw-r--r--notes/misc/example_entities.txt (renamed from notes/example_entities.txt)0
-rw-r--r--notes/misc/examples/content_scope.txt45
-rw-r--r--notes/misc/examples/grobid_500.txt4
-rw-r--r--notes/misc/examples/personal_favorites.md2
-rw-r--r--notes/misc/examples/random_journals.txt5
-rw-r--r--notes/misc/examples/random_works.txt9
-rw-r--r--notes/misc/examples/video_works.txt4
-rw-r--r--notes/misc/horror_stories.md10
-rw-r--r--notes/misc/rust_libraries.txt (renamed from notes/rust_libraries.txt)0
-rw-r--r--notes/misc/test_works.txt (renamed from notes/test_works.txt)0
-rw-r--r--notes/misc/thesis_uk.md6
-rw-r--r--notes/misc/unsorted.txt19
-rw-r--r--notes/misc/webface_iteration.md14
-rw-r--r--notes/misc/webface_notes.txt (renamed from notes/webface_notes.txt)0
18 files changed, 413 insertions, 0 deletions
diff --git a/notes/misc/2020-08_metadata.md b/notes/misc/2020-08_metadata.md
new file mode 100644
index 00000000..12cd6fb0
--- /dev/null
+++ b/notes/misc/2020-08_metadata.md
@@ -0,0 +1,79 @@
+
+## Artificial Containers
+
+ biorxiv
+ medrxiv
+ doi_prefix:10.1101
+ publisher:"Cold Spring Harbor Laboratory"
+ -> article-journal? article? should match "paper" filter
+ -> status: draft? submitted?
+ -> there is some flag in crossref metadata...
+
+ arxiv
+ -> article-journal?
+ -> set container_name?
+
+ protocols.io
+ doi_prefix:10.17504/protocols.io.
+ container_name:protocols.io
+
+ 10.25384/sage. -> sage.figshare.com
+ -> at least set container_name
+
+ figshare
+ doi_prefix:10.6084
+ -> at least set container_name
+
+ zenodo
+ -> at least set container_name
+
+Maybe? Later?
+
+ PsycEXTRA
+ container_name:"PsycEXTRA Dataset"
+ doi_prefix:10.1037
+ crossref
+ => 300k+ releases
+ => subtitle is 'number' (like "(577982012-038)")
+ => dataset
+ => publication status unknown
+
+ f1000 reviews
+ container_name:"F1000 - Post-publication peer review of the biomedical literature"
+ title:"Faculty of 1000 evaluation for "[...]
+ doi_prefix:10.3410/
+ crossref
+ => 222k releases
+ => type -> peer-review (?)
+
+ IUPAC Standards Online
+
+ GBIF
+ doi_prefix: 10.15468/dl.
+ => 838k releases
+
+==================
+
+
+later fatcat:
+- pmid+crossref pre-prints
+ https://fatcat.wiki/release/d4lrxugtqbapxgi4jrrlmzjily
+- zenodo: handle "repost from another ISSN" case (drop issn/container_id)
+- doi_prefix:10.18720 no container metadata; should be thesis type?
+- research square (10.21203) metadata (journal articles, pre-print or published?)
+- journals.ub.uni-heidelberg.de metadata is poor? no journal link
+- try_work_lookup() -> part of try update?
+ => zenodo "isidentical"
+ => zenodo "isversionof"
+ => figshare "isversionof"
+ => later, try_work_fuzzy()
+- biorxiv, medrxiv container name (and/or `container_id`?)
+ => and "article" not "post"
+- datacite container:"microPublication Biology" -> micropub type?
+- ES container index: `publisher_type` (?)
+- arxiv: remove release_type="report" logic
+- arxiv: don't include DOI, just merge under work
+- datacite release_type: resourceType=SaComponent -> 'component'
+ https://api.datacite.org/dois/10.1371/journal.pbio.0020429.g004
+- datacite title `{:unav}` (PLOS)
+ https://fatcat.wiki/release/search?q=doi_prefix%3A10.1371+unav
diff --git a/notes/misc/2020_ingest_ideas.md b/notes/misc/2020_ingest_ideas.md
new file mode 100644
index 00000000..fc5ea807
--- /dev/null
+++ b/notes/misc/2020_ingest_ideas.md
@@ -0,0 +1,14 @@
+
+https://philpapers.org/
+=> 2.4m entries
+
+https://philarchive.org/
+=> 50k OA papers
+=> OAI-PMH
+
+https://isidore.science/
+=> humanities in french, english, spanish
+=> APIs
+
+http://ascl.net/
+=> Astrophysics Source Code Library
diff --git a/notes/misc/2022-04_missing_oa.md b/notes/misc/2022-04_missing_oa.md
new file mode 100644
index 00000000..9a5541b9
--- /dev/null
+++ b/notes/misc/2022-04_missing_oa.md
@@ -0,0 +1,202 @@
+
+Short data exploration of what OA content is missing, and how it might be crawled.
+
+Starting with "front page" query:
+
+ is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084
+
+ doi_prefix:10.6084 is figshare
+ doi_prefix:10.5281 is zenodo
+
+ 14,658,673 66.56% preserved and publicly accessible (bright)
+ 3,453,052 15.68% preserved but not publicly accessible (dark)
+ 3,911,614 17.77% no known independent preservation
+ 22,023,339 100% total
+
+Virtually all of the "dark" is also `in_shadows:true`. So the
+`preservation:none` is the high-impact target for crawling.
+
+Limiting to `publisher_type:big5`, almost zero `preservation:none`, and 1.34
+million (41%) dark.
+
+## Publisher Type
+
+Created a kibana graph of the above filters, graphing `publisher_type` ("Publisher Type breakdown of missing OA)":
+
+ <missing> 1769k 54%
+ longtail 852k 26%
+ society 195k 6%
+ unipress 130k 4%
+ scielo 114k 3.5%
+ then: repository, oa, commercial, big5
+
+## Containers
+
+ !container_id:* preservation:none is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084
+
+ 1,993,639 missing preservation
+
+These are virtually all Datacite DOIs (not including figshare/zenodo), and
+start in 2008, ramping up. They are almost all missing `publisher_type` (which
+makes sense because they have no container).
+
+With the filters from above, here are some top containers missing content:
+
+ Missing 1,993,639
+ e27twid5qnbqbboxlkrja2xz2a 12,537
+ "Proceedings of Indian National Science Academy"
+ almost zero preservation. DOAJ website is 404 for article (!), no longer in DOAJ (!)
+ some kind of bad metadata situation? almost all from 2015
+ fmoqnzpewvfrnm2ni4mbvvlney 9,350
+ "Chinese Medical Journal"
+ PMIDs only
+ missing/unpreserved is pre-2015 (significant!)
+ 7l5xye7sc5emxfprwmqw2a7yxq 8,999
+ "Tidsskrift for Den norske legeforening" (norwegian medical)
+ bunch of PMIDs only; sporadic preservation coverage
+ ujftxdg3knebxhrqg4qjznz2he 5,903
+ "International Research Journal" (russian)
+ these are by-issue, with DOIs redirecting to pages inside issue (!)
+ kfzef6kfwbhpnfw3cifit7zw7q 5,678
+ "lectures"
+ hosted on openeditions
+ HTML ingest would work (!)
+ gr4g5qzzcnembf4om6yjb6qf34 5,020
+ "计算机科学"
+ mostly via dblp. some DOIs, presumably chinese?
+ bl77onlbbbhu5d6ohpjw2ypojy 4,994
+ "EOS" (from American Geophysical Union / AGU)
+ large publication, mostly preserved (dark)
+ mix of wiley.com OA (but hard to crawl?) and web/HTML stuff
+ 3afvqhtpnjd5nmiphwxlxzirde 4,877
+ "Medical Science Monitor"
+ large publication, mixed preservation
+ annoying PDF link situation (hard to crawl?)
+ tulajqojzjabfc4iybyv6poi2e 4,786
+ "Dermatology Online Journal"
+ large publication, mixed preservation
+ some just pmid
+ some HTML or ePub-only
+ escholarship.org
+
+A take-away here for me is that containers are pretty heterogenous and have
+diverse issues.
+
+TODO: ingest things like: https://escholarship.org/uc/item/02v86610
+ from container_tulajqojzjabfc4iybyv6poi2e
+
+### revues.org / openedition
+
+Many of these seem like they would ingest fine via HTML.
+
+ doi_prefix:10.4000
+
+ 151,565 34.3% preserved and publicly accessible (bright)
+ 7,211 1.64% preserved but not publicly accessible (dark)
+ 283,139 64.08% no known independent preservation
+ 441,915 100% total
+
+ article-journal 230,146 63% preserved
+ chapter 200,724 2% preserved
+ book 10,971 12% preserved
+ paper-conference 74
+
+Chapters and books don't seem as amenable to ingest... and indeed are mostly
+not marked `is_oa:true`.
+
+DONE: bulk html-mode ingest, expecting about 80k requests:
+
+ doi_prefix:10.4000 in_ia:false type:article-journal is_oa:true
+
+ ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc280.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc350.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-bulk \
+ --ingest-type html \
+ query "doi_prefix:10.4000 in_ia:false type:article-journal is_oa:true"
+ => Expecting 80032 release objects in search queries
+ => Counter({'ingest_request': 80032, 'elasticsearch_release': 80032, 'estimate': 80032, 'kafka': 80032})
+
+NOTE: have this be the default ingest type for this DOI prefix? not sure, some
+do come through as PDF just fine
+
+## Source of Records
+
+Starting with the 3,844,142 or so `preservation:none`.
+
+ doi 3.204m
+ datacite 1.995m
+ crossref 1.087m
+ <unknown> 109k
+ jalc 12k
+ doaj_id 553k
+ pmid 192k
+ dblp_id 29k
+ arxiv_id, pmcid 0
+
+I'm surprised how good dblp coverage is? Oh, but those are almost entirely
+missing OA status, that explains it.
+
+ # NOTE: not specifically OA
+ dblp_id:* year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference)
+
+ 406,235 22.54% preserved and publicly accessible (bright)
+ 59,009 3.28% preserved but not publicly accessible (dark)
+ 1,337,554 74.2% no known independent preservation
+ 1,802,798 100% total
+
+Looks like doi and DOAJ are big sources.
+
+ # NOTE: DOAJ implies OA, I checked and numbers are ~same
+ doaj_id:* is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference)
+
+ 588,364 47.27% preserved and publicly accessible (bright)
+ 103,206 8.3% preserved but not publicly accessible (dark)
+ 553,353 44.45% no known independent preservation
+ 1,244,923 100% total
+
+DOAJ ingest seems important to optimize!
+
+ !publisher_type:big5 container_id:* doaj_id:* is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference)
+ => 548,709 missing preservation
+
+ doaj_id:*
+ => 589,915 missing preservation
+
+Datacite the biggest category though, even with zenodo/figshare removed.
+
+TODO: largest datacite DOI prefixes
+TODO: check sandcrawler DB to see DOAJ ingest status; maybe these are entirely missing URLs? or just not crawling well?
+TODO: dig in to "longtail" more... some random ones?
+
+## Largest DOI Prefixes
+
+ <missing> 640,104
+ 10.48550 1,543,167
+ the new arxiv.org prefix
+ 10.4000 68,267
+ revues / openedition (handled above)
+ 10.25384 60,063
+ figshare / SAGE
+ 10.3917 52,195
+ cairn.info
+ 10.25673 41,565
+ some random IR? opendata.uni-halle.de
+ TODO: ingest this type of item, possibly using dataset->file crawler
+ 10.3406 33,778
+ persee.fr
+ blocks bots (don't attempt ingest)
+ 10.3205 33,540
+ "german medical science"
+ HTML articles, PDF links
+ TODO: fix ingest
+ https://www.egms.de/static/en/journals/gms/2020-18/000284.shtml
+ 10.17605 30,365
+ osf.io
+ TODO: fix ingest (?)
+ 10.25446 26,614
+ figshare / oxford
+ "File(s) not publicly available"
+ but "CC BY 4.0"? ugh
+
+TODO: HTML crawl cairn.info (10.3917)
+TODO: ignore 10.25384, 10.25446 (figshare)
+TODO: ignore arixv.org prefix (10.48550) in default dashboard
+TODO: handle arxiv.org DOIs better (merge, count as preserved, etc)
diff --git a/notes/UNSORTED.txt b/notes/misc/UNSORTED.txt
index 850b54d0..850b54d0 100644
--- a/notes/UNSORTED.txt
+++ b/notes/misc/UNSORTED.txt
diff --git a/notes/example_entities.txt b/notes/misc/example_entities.txt
index e4016d8a..e4016d8a 100644
--- a/notes/example_entities.txt
+++ b/notes/misc/example_entities.txt
diff --git a/notes/misc/examples/content_scope.txt b/notes/misc/examples/content_scope.txt
new file mode 100644
index 00000000..321dd056
--- /dev/null
+++ b/notes/misc/examples/content_scope.txt
@@ -0,0 +1,45 @@
+
+sha1:fe27d2d036d478fb692be95045b72773e0dc27ac
+https://fatcat.wiki/file/4tcvwhzunrgvri4x3uruug62jq
+
+ cover page... an ILL request? via ILL request.
+
+ "metadata": {
+ "author": "Emmanuel Lemoine",
+ "creator": "Okina",
+ "producer": "mPDF 6.0",
+ "title": "Chloro complexes of cobalt(II) in aprotic solvents: stability and structural modifications due to solvent effect"
+ },
+ "pdf_created": "2017-01-26T10:43:21+00:00",
+ "pdf_version": "1.4",
+ "permanent_id": "2d231660c0e26f92aad7cb2f62b5e03a",
+
+ SELECT *
+ FROM pdf_meta
+ WHERE
+ status = 'success'
+ AND page_count < 3
+ AND (metadata->>'creator')::text = 'Okina'
+ LIMIT 5;
+
+ SELECT COUNT(*)
+ FROM pdf_meta
+ WHERE
+ status = 'success'
+ AND page_count < 3
+ AND (metadata->>'creator')::text = 'Okina'
+ ;
+ # 4235
+
+ TODO: 'COPY TO'...
+
+ SELECT pdf_meta.sha1hex
+ FROM pdf_meta
+ LEFT JOIN fatcat_file ON pdf_meta.sha1hex = fatcat_file.sha1hex
+ WHERE
+ status = 'success'
+ AND page_count < 3
+ AND (metadata->>'creator')::text = 'Okina'
+ AND (metadata->>'publisher')::text LIKE 'mPDF%'
+ AND fatcat_file.ident IS NOT NULL
+ ;
diff --git a/notes/misc/examples/grobid_500.txt b/notes/misc/examples/grobid_500.txt
new file mode 100644
index 00000000..5e64c781
--- /dev/null
+++ b/notes/misc/examples/grobid_500.txt
@@ -0,0 +1,4 @@
+
+seems like a legit/fine PDF file:
+https://fatcat.wiki/file/nrydu6nutvedximcb4lpdsrp6u
+
diff --git a/notes/misc/examples/personal_favorites.md b/notes/misc/examples/personal_favorites.md
new file mode 100644
index 00000000..2ecee2d8
--- /dev/null
+++ b/notes/misc/examples/personal_favorites.md
@@ -0,0 +1,2 @@
+
+International Journal of Crashworthiness
diff --git a/notes/misc/examples/random_journals.txt b/notes/misc/examples/random_journals.txt
new file mode 100644
index 00000000..f5cb0e69
--- /dev/null
+++ b/notes/misc/examples/random_journals.txt
@@ -0,0 +1,5 @@
+
+"Rejecta Mathematica"
+only published articles which failed peer review.
+no longer online, but may be in wayback
+https://en.wikipedia.org/wiki/Rejecta_Mathematica
diff --git a/notes/misc/examples/random_works.txt b/notes/misc/examples/random_works.txt
new file mode 100644
index 00000000..3f5bb7e3
--- /dev/null
+++ b/notes/misc/examples/random_works.txt
@@ -0,0 +1,9 @@
+
+"The limitations of using languages for description", Marvin Minsky
+http://web.mit.edu/dxh/www/1970_Marvin_Lecture_Transcript_Italy_Limitations_Language.pdf
+
+"A Supercut of Supercuts: Aesthetics, Histories, Databases"
+https://vimeo.com/440746435
+https://www.openscreensjournal.com/article/id/6946/
+
+Dummy article in springer (paywalled!): https://doi.org/10.1007/s10096-005-0027-5
diff --git a/notes/misc/examples/video_works.txt b/notes/misc/examples/video_works.txt
new file mode 100644
index 00000000..6c0a450f
--- /dev/null
+++ b/notes/misc/examples/video_works.txt
@@ -0,0 +1,4 @@
+
+https://doi.org/10.24350/cirm.v.19933803
+ "Imaging with nonlinear and fractionally damped waves"
+ https://library.cirm-math.fr/Record.htm?record=19280247124910084299&confirm=on
diff --git a/notes/misc/horror_stories.md b/notes/misc/horror_stories.md
new file mode 100644
index 00000000..eaac48e7
--- /dev/null
+++ b/notes/misc/horror_stories.md
@@ -0,0 +1,10 @@
+
+Two different DOIs for the same work, from different publishers:
+
+ Intravenous Administration of Human γ-Globulin
+ S. Barandun, P. Kistler, F. Jeunet, H. Isliker
+ 1962, Vox Sanguinis
+
+ https://fatcat.wiki/release/search?q=%22Intravenous+administration+of+human+%CE%B3-globulin%22&generic=1
+ 10.1111/j.1423-0410.1962.tb03240.x
+ 10.1159/000464763
diff --git a/notes/rust_libraries.txt b/notes/misc/rust_libraries.txt
index d5c8c18a..d5c8c18a 100644
--- a/notes/rust_libraries.txt
+++ b/notes/misc/rust_libraries.txt
diff --git a/notes/test_works.txt b/notes/misc/test_works.txt
index 59b01701..59b01701 100644
--- a/notes/test_works.txt
+++ b/notes/misc/test_works.txt
diff --git a/notes/misc/thesis_uk.md b/notes/misc/thesis_uk.md
new file mode 100644
index 00000000..cbcca6d5
--- /dev/null
+++ b/notes/misc/thesis_uk.md
@@ -0,0 +1,6 @@
+
+large number of doctoral thesis metadata, from EThOS
+https://bl.iro.bl.uk/concern/datasets/c815b271-09be-4123-8156-405094429198
+
+will get via OAI-PMH, presumably. but, requires login for actual download?
+sigh.
diff --git a/notes/misc/unsorted.txt b/notes/misc/unsorted.txt
new file mode 100644
index 00000000..17ff839c
--- /dev/null
+++ b/notes/misc/unsorted.txt
@@ -0,0 +1,19 @@
+
+fatcat misc:
+- opencitations: https://arxiv.org/abs/1906.11964
+- https://pub.uni-bielefeld.de/record/2934907
+- re-read: scratch/issn/web_archiving.md
+- should expansion of 'wip' entities be allowed?
+- could now just not show 'wip' entities (unless part of editgroup)
+- release_ref | 19904400 | Missing Index? | 4141039616 | 81833687 | 61929287
+- privacy/security issue with libmacaroon logging failed caveat verification
+- blank box on editgroup pages when not logged in
+- don't have "Editable catalog of bibliographic and fulltext file metadata" be the thing in snippets?
+- web: '|dictsort' in a bunch of places (for stability)
+- example HTML paper: https://andrewgyork.github.io/rescan_line_sted/
+- pubmed importer should include section in ALLCAPS: for multi-part abstracts
+- https://github.com/rholder/retrying
+- feature: push-button "update metadata from crossref"
+- demo ORCID: 0000-0002-1825-0097
+- link: https://www.jstor.org/dfr/about/technical-specifications
+- after indexing, optimise the Elasticsearch index by merging into a single segment: curl -XPOST 'http://localhost:9200/scholar/_forcemerge?max_num_segments=1'
diff --git a/notes/misc/webface_iteration.md b/notes/misc/webface_iteration.md
new file mode 100644
index 00000000..a7f11d15
--- /dev/null
+++ b/notes/misc/webface_iteration.md
@@ -0,0 +1,14 @@
+
+## Design Examples
+
+metamath
+
+- example: <https://wapm.io/package/liftm/metamath>
+- somewhat similar to existing fatcat release layout
+- tabs are better? tabs scroll left/right on mobile
+- CSS/etc is heavy, though design is simple
+
+lib.rs
+
+sourcehut
+
diff --git a/notes/webface_notes.txt b/notes/misc/webface_notes.txt
index 37a56c5c..37a56c5c 100644
--- a/notes/webface_notes.txt
+++ b/notes/misc/webface_notes.txt