From dca72aa11d24cbe8272c86d221a400c9859fb7e3 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 4 Jan 2023 20:00:26 -0800 Subject: notes: commit a whole bunch of random notes and files --- notes/UNSORTED.txt | 40 ------ notes/example_entities.txt | 58 --------- notes/misc/2020-08_metadata.md | 79 ++++++++++++ notes/misc/2020_ingest_ideas.md | 14 +++ notes/misc/2022-04_missing_oa.md | 202 ++++++++++++++++++++++++++++++ notes/misc/UNSORTED.txt | 40 ++++++ notes/misc/example_entities.txt | 58 +++++++++ notes/misc/examples/content_scope.txt | 45 +++++++ notes/misc/examples/grobid_500.txt | 4 + notes/misc/examples/personal_favorites.md | 2 + notes/misc/examples/random_journals.txt | 5 + notes/misc/examples/random_works.txt | 9 ++ notes/misc/examples/video_works.txt | 4 + notes/misc/horror_stories.md | 10 ++ notes/misc/rust_libraries.txt | 41 ++++++ notes/misc/test_works.txt | 77 ++++++++++++ notes/misc/thesis_uk.md | 6 + notes/misc/unsorted.txt | 19 +++ notes/misc/webface_iteration.md | 14 +++ notes/misc/webface_notes.txt | 62 +++++++++ notes/rust_libraries.txt | 41 ------ notes/test_works.txt | 77 ------------ notes/webface_notes.txt | 62 --------- 23 files changed, 691 insertions(+), 278 deletions(-) delete mode 100644 notes/UNSORTED.txt delete mode 100644 notes/example_entities.txt create mode 100644 notes/misc/2020-08_metadata.md create mode 100644 notes/misc/2020_ingest_ideas.md create mode 100644 notes/misc/2022-04_missing_oa.md create mode 100644 notes/misc/UNSORTED.txt create mode 100644 notes/misc/example_entities.txt create mode 100644 notes/misc/examples/content_scope.txt create mode 100644 notes/misc/examples/grobid_500.txt create mode 100644 notes/misc/examples/personal_favorites.md create mode 100644 notes/misc/examples/random_journals.txt create mode 100644 notes/misc/examples/random_works.txt create mode 100644 notes/misc/examples/video_works.txt create mode 100644 notes/misc/horror_stories.md create mode 100644 notes/misc/rust_libraries.txt create mode 100644 notes/misc/test_works.txt create mode 100644 notes/misc/thesis_uk.md create mode 100644 notes/misc/unsorted.txt create mode 100644 notes/misc/webface_iteration.md create mode 100644 notes/misc/webface_notes.txt delete mode 100644 notes/rust_libraries.txt delete mode 100644 notes/test_works.txt delete mode 100644 notes/webface_notes.txt (limited to 'notes') diff --git a/notes/UNSORTED.txt b/notes/UNSORTED.txt deleted file mode 100644 index 850b54d0..00000000 --- a/notes/UNSORTED.txt +++ /dev/null @@ -1,40 +0,0 @@ - -Not allowed to PUT edits to the same entity in the same editgroup. If you want -to update an edit, need to delete the old one first. - -The state depends only on the current entity state, not any redirect. This -means that if the target of a redirect is deleted, the redirecting entity is -still "redirect", not "deleted". - -Redirects-to-redirects are not allowed; this is enforced when the editgroup is -accepted, to prevent race conditions. - -Redirects to "work-in-progress" (WIP) rows are disallowed at update time (and -not re-checked at accept time). - -"ident table" parameters are ignored for entity updates. This is so clients can -simply re-use object instantiations. - -The "state" parameter of an entity body is used as a flag when deciding whether -to do non-normal updates (eg, redirect or undelete, as opposed to inserting a -new revision). - -In the API, if you, eg, expand=files on a redirected release, you will get -files that point to the *target* release entity. If you use the /files endpoint -(instead of expand), you will get the files pointing to the redirected entity -(which probably need updating!). Also, if you expand=files on the target -entity, you *won't* get the files pointing to the redirected release. A -high-level merge process might make these changes at the same time? Or at least -tag at edit review time. A sweeper task can look for and auto-correct such -redirects after some delay period. - -=> it would not be too hard to update get_release_files to check for such - redirects; could be handled by request flag? - -`prev_rev` is naively set to the most-recent previous state. If the current -state was deleted or a redirect, it is set to null. - -This parameter is not checked/enforced at edit accept time (but could be, and -maybe introduce `prev_redirect`, for race detection). Or, could have ident -point to most-recent edit, and have edits point to prev, for firmer control. - diff --git a/notes/example_entities.txt b/notes/example_entities.txt deleted file mode 100644 index e4016d8a..00000000 --- a/notes/example_entities.txt +++ /dev/null @@ -1,58 +0,0 @@ - -errata/update: - Fourth Test of General Relativity: Preliminary Results - 10.1103/physrevlett.20.1265 - 10.1103/physrevlett.21.266.3 - - same title; later is errata to the first. - very minor: The term "baud length" was consistently misprinted as "band length." - -DOIs for individual images - https://commons.wikimedia.org/wiki/Category:Media_from_Williams_et_al._2010_-_10.1371/journal.pone.0010676 - -long-tail journal not in fatcat; web-native, tricky to crawl - https://angryoldmanmagazine.com/ - -dataset - "ISSN-Matching of Gold OA Journals (ISSN-GOLD-OA) 2.0" - https://pub.uni-bielefeld.de/data/2913654 - 2 files - has DOI: 10.4119/unibi/2913654 - -release group; single PDF is valid copy of two DOIs: - https://fatcat.wiki/file/wr64e37yvfcidgbowtslx7omne - 10.5167/uzh-146424 - 10.1016/j.physletb.2017.12.006 - ALSO: has CC-BY license_slug - -bad MAG match: - - https://fatcat.wiki/release/b65rjfixxbh4zjd3zxcjdz2b6e - https://academic.microsoft.com/paper/2535407850 - MAG has wrong metadata? have not corrected in fatcat - - -## Long-Tail Content - -humanities journal; content in SIM and Proquest, no Keepers, no DOIs: - - Clio: A Journal of Literature, History, and the Philosophy of History - https://fatcat.wiki/container/bsn7fpeyx5ep7eyjgxxd5oygsa - -### Examples from Twitter - -Thread from 2021: - -- Granta Magazine -- Punk Planet (in IA?) -- Black Clock (https://en.wikipedia.org/wiki/Black_Clock) -- Le Grand Jeu -- ILK Journal (in wayback: http://web.archive.org/web/20160331182524/http://ilkjournal.com/journal/issue-fourteen/roberto-montes/) - - -### Vanished Content - -"Abril" -https://fatcat.wiki/container/stdnbbwbpzflzhp2syctupqtc4 - in DOAJ - broken DOIs, but new website does exist? diff --git a/notes/misc/2020-08_metadata.md b/notes/misc/2020-08_metadata.md new file mode 100644 index 00000000..12cd6fb0 --- /dev/null +++ b/notes/misc/2020-08_metadata.md @@ -0,0 +1,79 @@ + +## Artificial Containers + + biorxiv + medrxiv + doi_prefix:10.1101 + publisher:"Cold Spring Harbor Laboratory" + -> article-journal? article? should match "paper" filter + -> status: draft? submitted? + -> there is some flag in crossref metadata... + + arxiv + -> article-journal? + -> set container_name? + + protocols.io + doi_prefix:10.17504/protocols.io. + container_name:protocols.io + + 10.25384/sage. -> sage.figshare.com + -> at least set container_name + + figshare + doi_prefix:10.6084 + -> at least set container_name + + zenodo + -> at least set container_name + +Maybe? Later? + + PsycEXTRA + container_name:"PsycEXTRA Dataset" + doi_prefix:10.1037 + crossref + => 300k+ releases + => subtitle is 'number' (like "(577982012-038)") + => dataset + => publication status unknown + + f1000 reviews + container_name:"F1000 - Post-publication peer review of the biomedical literature" + title:"Faculty of 1000 evaluation for "[...] + doi_prefix:10.3410/ + crossref + => 222k releases + => type -> peer-review (?) + + IUPAC Standards Online + + GBIF + doi_prefix: 10.15468/dl. + => 838k releases + +================== + + +later fatcat: +- pmid+crossref pre-prints + https://fatcat.wiki/release/d4lrxugtqbapxgi4jrrlmzjily +- zenodo: handle "repost from another ISSN" case (drop issn/container_id) +- doi_prefix:10.18720 no container metadata; should be thesis type? +- research square (10.21203) metadata (journal articles, pre-print or published?) +- journals.ub.uni-heidelberg.de metadata is poor? no journal link +- try_work_lookup() -> part of try update? + => zenodo "isidentical" + => zenodo "isversionof" + => figshare "isversionof" + => later, try_work_fuzzy() +- biorxiv, medrxiv container name (and/or `container_id`?) + => and "article" not "post" +- datacite container:"microPublication Biology" -> micropub type? +- ES container index: `publisher_type` (?) +- arxiv: remove release_type="report" logic +- arxiv: don't include DOI, just merge under work +- datacite release_type: resourceType=SaComponent -> 'component' + https://api.datacite.org/dois/10.1371/journal.pbio.0020429.g004 +- datacite title `{:unav}` (PLOS) + https://fatcat.wiki/release/search?q=doi_prefix%3A10.1371+unav diff --git a/notes/misc/2020_ingest_ideas.md b/notes/misc/2020_ingest_ideas.md new file mode 100644 index 00000000..fc5ea807 --- /dev/null +++ b/notes/misc/2020_ingest_ideas.md @@ -0,0 +1,14 @@ + +https://philpapers.org/ +=> 2.4m entries + +https://philarchive.org/ +=> 50k OA papers +=> OAI-PMH + +https://isidore.science/ +=> humanities in french, english, spanish +=> APIs + +http://ascl.net/ +=> Astrophysics Source Code Library diff --git a/notes/misc/2022-04_missing_oa.md b/notes/misc/2022-04_missing_oa.md new file mode 100644 index 00000000..9a5541b9 --- /dev/null +++ b/notes/misc/2022-04_missing_oa.md @@ -0,0 +1,202 @@ + +Short data exploration of what OA content is missing, and how it might be crawled. + +Starting with "front page" query: + + is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084 + + doi_prefix:10.6084 is figshare + doi_prefix:10.5281 is zenodo + + 14,658,673 66.56% preserved and publicly accessible (bright) + 3,453,052 15.68% preserved but not publicly accessible (dark) + 3,911,614 17.77% no known independent preservation + 22,023,339 100% total + +Virtually all of the "dark" is also `in_shadows:true`. So the +`preservation:none` is the high-impact target for crawling. + +Limiting to `publisher_type:big5`, almost zero `preservation:none`, and 1.34 +million (41%) dark. + +## Publisher Type + +Created a kibana graph of the above filters, graphing `publisher_type` ("Publisher Type breakdown of missing OA)": + + 1769k 54% + longtail 852k 26% + society 195k 6% + unipress 130k 4% + scielo 114k 3.5% + then: repository, oa, commercial, big5 + +## Containers + + !container_id:* preservation:none is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084 + + 1,993,639 missing preservation + +These are virtually all Datacite DOIs (not including figshare/zenodo), and +start in 2008, ramping up. They are almost all missing `publisher_type` (which +makes sense because they have no container). + +With the filters from above, here are some top containers missing content: + + Missing 1,993,639 + e27twid5qnbqbboxlkrja2xz2a 12,537 + "Proceedings of Indian National Science Academy" + almost zero preservation. DOAJ website is 404 for article (!), no longer in DOAJ (!) + some kind of bad metadata situation? almost all from 2015 + fmoqnzpewvfrnm2ni4mbvvlney 9,350 + "Chinese Medical Journal" + PMIDs only + missing/unpreserved is pre-2015 (significant!) + 7l5xye7sc5emxfprwmqw2a7yxq 8,999 + "Tidsskrift for Den norske legeforening" (norwegian medical) + bunch of PMIDs only; sporadic preservation coverage + ujftxdg3knebxhrqg4qjznz2he 5,903 + "International Research Journal" (russian) + these are by-issue, with DOIs redirecting to pages inside issue (!) + kfzef6kfwbhpnfw3cifit7zw7q 5,678 + "lectures" + hosted on openeditions + HTML ingest would work (!) + gr4g5qzzcnembf4om6yjb6qf34 5,020 + "计算机科学" + mostly via dblp. some DOIs, presumably chinese? + bl77onlbbbhu5d6ohpjw2ypojy 4,994 + "EOS" (from American Geophysical Union / AGU) + large publication, mostly preserved (dark) + mix of wiley.com OA (but hard to crawl?) and web/HTML stuff + 3afvqhtpnjd5nmiphwxlxzirde 4,877 + "Medical Science Monitor" + large publication, mixed preservation + annoying PDF link situation (hard to crawl?) + tulajqojzjabfc4iybyv6poi2e 4,786 + "Dermatology Online Journal" + large publication, mixed preservation + some just pmid + some HTML or ePub-only + escholarship.org + +A take-away here for me is that containers are pretty heterogenous and have +diverse issues. + +TODO: ingest things like: https://escholarship.org/uc/item/02v86610 + from container_tulajqojzjabfc4iybyv6poi2e + +### revues.org / openedition + +Many of these seem like they would ingest fine via HTML. + + doi_prefix:10.4000 + + 151,565 34.3% preserved and publicly accessible (bright) + 7,211 1.64% preserved but not publicly accessible (dark) + 283,139 64.08% no known independent preservation + 441,915 100% total + + article-journal 230,146 63% preserved + chapter 200,724 2% preserved + book 10,971 12% preserved + paper-conference 74 + +Chapters and books don't seem as amenable to ingest... and indeed are mostly +not marked `is_oa:true`. + +DONE: bulk html-mode ingest, expecting about 80k requests: + + doi_prefix:10.4000 in_ia:false type:article-journal is_oa:true + + ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc280.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc350.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-bulk \ + --ingest-type html \ + query "doi_prefix:10.4000 in_ia:false type:article-journal is_oa:true" + => Expecting 80032 release objects in search queries + => Counter({'ingest_request': 80032, 'elasticsearch_release': 80032, 'estimate': 80032, 'kafka': 80032}) + +NOTE: have this be the default ingest type for this DOI prefix? not sure, some +do come through as PDF just fine + +## Source of Records + +Starting with the 3,844,142 or so `preservation:none`. + + doi 3.204m + datacite 1.995m + crossref 1.087m + 109k + jalc 12k + doaj_id 553k + pmid 192k + dblp_id 29k + arxiv_id, pmcid 0 + +I'm surprised how good dblp coverage is? Oh, but those are almost entirely +missing OA status, that explains it. + + # NOTE: not specifically OA + dblp_id:* year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) + + 406,235 22.54% preserved and publicly accessible (bright) + 59,009 3.28% preserved but not publicly accessible (dark) + 1,337,554 74.2% no known independent preservation + 1,802,798 100% total + +Looks like doi and DOAJ are big sources. + + # NOTE: DOAJ implies OA, I checked and numbers are ~same + doaj_id:* is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) + + 588,364 47.27% preserved and publicly accessible (bright) + 103,206 8.3% preserved but not publicly accessible (dark) + 553,353 44.45% no known independent preservation + 1,244,923 100% total + +DOAJ ingest seems important to optimize! + + !publisher_type:big5 container_id:* doaj_id:* is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) + => 548,709 missing preservation + + doaj_id:* + => 589,915 missing preservation + +Datacite the biggest category though, even with zenodo/figshare removed. + +TODO: largest datacite DOI prefixes +TODO: check sandcrawler DB to see DOAJ ingest status; maybe these are entirely missing URLs? or just not crawling well? +TODO: dig in to "longtail" more... some random ones? + +## Largest DOI Prefixes + + 640,104 + 10.48550 1,543,167 + the new arxiv.org prefix + 10.4000 68,267 + revues / openedition (handled above) + 10.25384 60,063 + figshare / SAGE + 10.3917 52,195 + cairn.info + 10.25673 41,565 + some random IR? opendata.uni-halle.de + TODO: ingest this type of item, possibly using dataset->file crawler + 10.3406 33,778 + persee.fr + blocks bots (don't attempt ingest) + 10.3205 33,540 + "german medical science" + HTML articles, PDF links + TODO: fix ingest + https://www.egms.de/static/en/journals/gms/2020-18/000284.shtml + 10.17605 30,365 + osf.io + TODO: fix ingest (?) + 10.25446 26,614 + figshare / oxford + "File(s) not publicly available" + but "CC BY 4.0"? ugh + +TODO: HTML crawl cairn.info (10.3917) +TODO: ignore 10.25384, 10.25446 (figshare) +TODO: ignore arixv.org prefix (10.48550) in default dashboard +TODO: handle arxiv.org DOIs better (merge, count as preserved, etc) diff --git a/notes/misc/UNSORTED.txt b/notes/misc/UNSORTED.txt new file mode 100644 index 00000000..850b54d0 --- /dev/null +++ b/notes/misc/UNSORTED.txt @@ -0,0 +1,40 @@ + +Not allowed to PUT edits to the same entity in the same editgroup. If you want +to update an edit, need to delete the old one first. + +The state depends only on the current entity state, not any redirect. This +means that if the target of a redirect is deleted, the redirecting entity is +still "redirect", not "deleted". + +Redirects-to-redirects are not allowed; this is enforced when the editgroup is +accepted, to prevent race conditions. + +Redirects to "work-in-progress" (WIP) rows are disallowed at update time (and +not re-checked at accept time). + +"ident table" parameters are ignored for entity updates. This is so clients can +simply re-use object instantiations. + +The "state" parameter of an entity body is used as a flag when deciding whether +to do non-normal updates (eg, redirect or undelete, as opposed to inserting a +new revision). + +In the API, if you, eg, expand=files on a redirected release, you will get +files that point to the *target* release entity. If you use the /files endpoint +(instead of expand), you will get the files pointing to the redirected entity +(which probably need updating!). Also, if you expand=files on the target +entity, you *won't* get the files pointing to the redirected release. A +high-level merge process might make these changes at the same time? Or at least +tag at edit review time. A sweeper task can look for and auto-correct such +redirects after some delay period. + +=> it would not be too hard to update get_release_files to check for such + redirects; could be handled by request flag? + +`prev_rev` is naively set to the most-recent previous state. If the current +state was deleted or a redirect, it is set to null. + +This parameter is not checked/enforced at edit accept time (but could be, and +maybe introduce `prev_redirect`, for race detection). Or, could have ident +point to most-recent edit, and have edits point to prev, for firmer control. + diff --git a/notes/misc/example_entities.txt b/notes/misc/example_entities.txt new file mode 100644 index 00000000..e4016d8a --- /dev/null +++ b/notes/misc/example_entities.txt @@ -0,0 +1,58 @@ + +errata/update: + Fourth Test of General Relativity: Preliminary Results + 10.1103/physrevlett.20.1265 + 10.1103/physrevlett.21.266.3 + + same title; later is errata to the first. + very minor: The term "baud length" was consistently misprinted as "band length." + +DOIs for individual images + https://commons.wikimedia.org/wiki/Category:Media_from_Williams_et_al._2010_-_10.1371/journal.pone.0010676 + +long-tail journal not in fatcat; web-native, tricky to crawl + https://angryoldmanmagazine.com/ + +dataset + "ISSN-Matching of Gold OA Journals (ISSN-GOLD-OA) 2.0" + https://pub.uni-bielefeld.de/data/2913654 + 2 files + has DOI: 10.4119/unibi/2913654 + +release group; single PDF is valid copy of two DOIs: + https://fatcat.wiki/file/wr64e37yvfcidgbowtslx7omne + 10.5167/uzh-146424 + 10.1016/j.physletb.2017.12.006 + ALSO: has CC-BY license_slug + +bad MAG match: + + https://fatcat.wiki/release/b65rjfixxbh4zjd3zxcjdz2b6e + https://academic.microsoft.com/paper/2535407850 + MAG has wrong metadata? have not corrected in fatcat + + +## Long-Tail Content + +humanities journal; content in SIM and Proquest, no Keepers, no DOIs: + + Clio: A Journal of Literature, History, and the Philosophy of History + https://fatcat.wiki/container/bsn7fpeyx5ep7eyjgxxd5oygsa + +### Examples from Twitter + +Thread from 2021: + +- Granta Magazine +- Punk Planet (in IA?) +- Black Clock (https://en.wikipedia.org/wiki/Black_Clock) +- Le Grand Jeu +- ILK Journal (in wayback: http://web.archive.org/web/20160331182524/http://ilkjournal.com/journal/issue-fourteen/roberto-montes/) + + +### Vanished Content + +"Abril" +https://fatcat.wiki/container/stdnbbwbpzflzhp2syctupqtc4 + in DOAJ + broken DOIs, but new website does exist? diff --git a/notes/misc/examples/content_scope.txt b/notes/misc/examples/content_scope.txt new file mode 100644 index 00000000..321dd056 --- /dev/null +++ b/notes/misc/examples/content_scope.txt @@ -0,0 +1,45 @@ + +sha1:fe27d2d036d478fb692be95045b72773e0dc27ac +https://fatcat.wiki/file/4tcvwhzunrgvri4x3uruug62jq + + cover page... an ILL request? via ILL request. + + "metadata": { + "author": "Emmanuel Lemoine", + "creator": "Okina", + "producer": "mPDF 6.0", + "title": "Chloro complexes of cobalt(II) in aprotic solvents: stability and structural modifications due to solvent effect" + }, + "pdf_created": "2017-01-26T10:43:21+00:00", + "pdf_version": "1.4", + "permanent_id": "2d231660c0e26f92aad7cb2f62b5e03a", + + SELECT * + FROM pdf_meta + WHERE + status = 'success' + AND page_count < 3 + AND (metadata->>'creator')::text = 'Okina' + LIMIT 5; + + SELECT COUNT(*) + FROM pdf_meta + WHERE + status = 'success' + AND page_count < 3 + AND (metadata->>'creator')::text = 'Okina' + ; + # 4235 + + TODO: 'COPY TO'... + + SELECT pdf_meta.sha1hex + FROM pdf_meta + LEFT JOIN fatcat_file ON pdf_meta.sha1hex = fatcat_file.sha1hex + WHERE + status = 'success' + AND page_count < 3 + AND (metadata->>'creator')::text = 'Okina' + AND (metadata->>'publisher')::text LIKE 'mPDF%' + AND fatcat_file.ident IS NOT NULL + ; diff --git a/notes/misc/examples/grobid_500.txt b/notes/misc/examples/grobid_500.txt new file mode 100644 index 00000000..5e64c781 --- /dev/null +++ b/notes/misc/examples/grobid_500.txt @@ -0,0 +1,4 @@ + +seems like a legit/fine PDF file: +https://fatcat.wiki/file/nrydu6nutvedximcb4lpdsrp6u + diff --git a/notes/misc/examples/personal_favorites.md b/notes/misc/examples/personal_favorites.md new file mode 100644 index 00000000..2ecee2d8 --- /dev/null +++ b/notes/misc/examples/personal_favorites.md @@ -0,0 +1,2 @@ + +International Journal of Crashworthiness diff --git a/notes/misc/examples/random_journals.txt b/notes/misc/examples/random_journals.txt new file mode 100644 index 00000000..f5cb0e69 --- /dev/null +++ b/notes/misc/examples/random_journals.txt @@ -0,0 +1,5 @@ + +"Rejecta Mathematica" +only published articles which failed peer review. +no longer online, but may be in wayback +https://en.wikipedia.org/wiki/Rejecta_Mathematica diff --git a/notes/misc/examples/random_works.txt b/notes/misc/examples/random_works.txt new file mode 100644 index 00000000..3f5bb7e3 --- /dev/null +++ b/notes/misc/examples/random_works.txt @@ -0,0 +1,9 @@ + +"The limitations of using languages for description", Marvin Minsky +http://web.mit.edu/dxh/www/1970_Marvin_Lecture_Transcript_Italy_Limitations_Language.pdf + +"A Supercut of Supercuts: Aesthetics, Histories, Databases" +https://vimeo.com/440746435 +https://www.openscreensjournal.com/article/id/6946/ + +Dummy article in springer (paywalled!): https://doi.org/10.1007/s10096-005-0027-5 diff --git a/notes/misc/examples/video_works.txt b/notes/misc/examples/video_works.txt new file mode 100644 index 00000000..6c0a450f --- /dev/null +++ b/notes/misc/examples/video_works.txt @@ -0,0 +1,4 @@ + +https://doi.org/10.24350/cirm.v.19933803 + "Imaging with nonlinear and fractionally damped waves" + https://library.cirm-math.fr/Record.htm?record=19280247124910084299&confirm=on diff --git a/notes/misc/horror_stories.md b/notes/misc/horror_stories.md new file mode 100644 index 00000000..eaac48e7 --- /dev/null +++ b/notes/misc/horror_stories.md @@ -0,0 +1,10 @@ + +Two different DOIs for the same work, from different publishers: + + Intravenous Administration of Human γ-Globulin + S. Barandun, P. Kistler, F. Jeunet, H. Isliker + 1962, Vox Sanguinis + + https://fatcat.wiki/release/search?q=%22Intravenous+administration+of+human+%CE%B3-globulin%22&generic=1 + 10.1111/j.1423-0410.1962.tb03240.x + 10.1159/000464763 diff --git a/notes/misc/rust_libraries.txt b/notes/misc/rust_libraries.txt new file mode 100644 index 00000000..d5c8c18a --- /dev/null +++ b/notes/misc/rust_libraries.txt @@ -0,0 +1,41 @@ + +libs: +- iron_slog +- testing: keep it simple: iron-test + => if that is annoying, shiny? mockers if needed. +- sentry +- start with dotenv+clap, then config-rs? +- cadence (emits statsd) +- frank_jwt and JWT for (simple?) auth + +metrics: +- best would be something with a configurable back-end, like 'log' for logging, + but supporing tags/labels. the prometheus model probably makes most sense by + default (really nice to be able to grab metrics with 'curl'/browser for + individual instances), but statsd seems to be what we run in production. not + spewing out lots of UDP by default seems like a good idea. +- dipstick: has all the good features, and popular, but code quality has smells + ("a32dlkjhw"-style commit messages), and API doesn't seem very clean. Also + prometheus stuff not actually implemented +- cadence: seems stable, somewhat popular, clean API. statsd-only for now, but + has custom backends that could be hooked on to. *super* few dependencies, + nice. +- tic: many deps; doesn't seem stable or under development +- rust-prometheus: developed by pingcap (large company). has push and pull + features. medum-sized deps; has feature flags + +A nice feature of a statsd solution is that collectd is usually running +locally (on linux dev, or in production), and metrics can be sent there by +default, like journald for logging. + +Seems like a decision between cadence (statsd) and rust-prometheus. + +similar: +- https://github.com/DavidBM/templic-backend +- https://github.com/alexanderbanks/rust-api +- https://mgattozzi.com/diesel-powered-rocket +- https://www.reddit.com/r/rust/comments/8j1xbs/new_to_rust_and_gitlab_ci/ +- https://crate-ci.github.io/ + +"cool tools": +- cargo-watch diff --git a/notes/misc/test_works.txt b/notes/misc/test_works.txt new file mode 100644 index 00000000..59b01701 --- /dev/null +++ b/notes/misc/test_works.txt @@ -0,0 +1,77 @@ + +http://mathsci.wikia.com/wiki/The_Haruhi_Problem + +## Found because Famous + +Many co-authors (group): + + "Precision measurement of the top-quark mass in lepton+jets final states" + https://arxiv.org/abs/1405.1756 + +"Fake" creator: Bourbaki + +"Fake" works: John Bohanon sting operations, previous scandals + +## Found in Testing Imports + +Two releases, same work (actually same release?): + + Freiheit für Nutzer, nicht für Software + 10.14361/transcript.9783839420362.366 + 10.14361/9783839428351-056 + + May also have link via crossref metadata? + +Fun ellen examples: + + Just-in-time databases and the World-Wide Web + 10.1145/288627.288638 + + Two different versions of PDF found, same URL + +Actual ORCID match: + + 10.1002/cfg.158 + 0000-0002-4447-5978 + +Fulltext via CORE publisher-connector: + + 10.1186/s12889-016-2706-9 + +Fake/example DOI: 10.5555/12345678 +ORCID: 0000-0002-1825-0097 +ISSN (invalid?): 0264-3561 + +We have fulltext via long-tail; only Google also has a copy: + ON DECOMPOSITIONS OF THE IDENTITY OPERATOR INTO A LINEAR COMBINATION OF ORTHOGONAL PROJECTIONS + http://mfat.imath.kiev.ua/article/?id=543 + 2010, open access + Institute of Mathematics NAS of Ukraine + "arXiv overlay journal" + sha1=0d39d932aad191fe8ed07572d96260ee4fad26aa + +Very large authorship/reference lists: + +- 10.1038/nature.2015.17567 (not in crossref metadata) +- 10.1038/nature14474 +- 10.1534/g3.114.015966 + +DOIs same except for an extra slash: + + 10.1037/0003-066x.39.1.40 + 10.1037//0003-066x.39.1.40 + +## Missing + +"ACE: A Novel Software Platform to Ensure the Integrity [...]" + +"Periods of Twenty-five Variable Stars in the Small Magellanic Cloud" by +Leavitt, Henrietta +=> shows as a chapter, not the original paper +=> in google scholar as "Periods of 25 Variable Stars in the Small Magellanic Cloud.", as well as several other harvard.edu results + +"Browser history re:visited" +=> no DOI; conference proceeding +=> in google scholar +=> random un-published version at https://www.spinda.net/; "The copy of the + paper hosted here has been updated to reflect [...]" diff --git a/notes/misc/thesis_uk.md b/notes/misc/thesis_uk.md new file mode 100644 index 00000000..cbcca6d5 --- /dev/null +++ b/notes/misc/thesis_uk.md @@ -0,0 +1,6 @@ + +large number of doctoral thesis metadata, from EThOS +https://bl.iro.bl.uk/concern/datasets/c815b271-09be-4123-8156-405094429198 + +will get via OAI-PMH, presumably. but, requires login for actual download? +sigh. diff --git a/notes/misc/unsorted.txt b/notes/misc/unsorted.txt new file mode 100644 index 00000000..17ff839c --- /dev/null +++ b/notes/misc/unsorted.txt @@ -0,0 +1,19 @@ + +fatcat misc: +- opencitations: https://arxiv.org/abs/1906.11964 +- https://pub.uni-bielefeld.de/record/2934907 +- re-read: scratch/issn/web_archiving.md +- should expansion of 'wip' entities be allowed? +- could now just not show 'wip' entities (unless part of editgroup) +- release_ref | 19904400 | Missing Index? | 4141039616 | 81833687 | 61929287 +- privacy/security issue with libmacaroon logging failed caveat verification +- blank box on editgroup pages when not logged in +- don't have "Editable catalog of bibliographic and fulltext file metadata" be the thing in snippets? +- web: '|dictsort' in a bunch of places (for stability) +- example HTML paper: https://andrewgyork.github.io/rescan_line_sted/ +- pubmed importer should include section in ALLCAPS: for multi-part abstracts +- https://github.com/rholder/retrying +- feature: push-button "update metadata from crossref" +- demo ORCID: 0000-0002-1825-0097 +- link: https://www.jstor.org/dfr/about/technical-specifications +- after indexing, optimise the Elasticsearch index by merging into a single segment: curl -XPOST 'http://localhost:9200/scholar/_forcemerge?max_num_segments=1' diff --git a/notes/misc/webface_iteration.md b/notes/misc/webface_iteration.md new file mode 100644 index 00000000..a7f11d15 --- /dev/null +++ b/notes/misc/webface_iteration.md @@ -0,0 +1,14 @@ + +## Design Examples + +metamath + +- example: +- somewhat similar to existing fatcat release layout +- tabs are better? tabs scroll left/right on mobile +- CSS/etc is heavy, though design is simple + +lib.rs + +sourcehut + diff --git a/notes/misc/webface_notes.txt b/notes/misc/webface_notes.txt new file mode 100644 index 00000000..37a56c5c --- /dev/null +++ b/notes/misc/webface_notes.txt @@ -0,0 +1,62 @@ + +# CSS/JS Libraries + +tachyons is nice for simple css-only stuff, but let's use "Semantic UI" because +it has a bunch of javascript form stuff. + + + + + +# "Add Something" Workflow + +## Add a Work + +Title +Primary Type +Primary Creators/Authors +Description (not an abstract) +Primary/Original Language +Subject/Categorization/Tags +Is a Stub (unpublished/unreleased) + +## Release Information + +Contributors +Date +Container / Part-Of +Publisher +Identifiers +Language +Type / Media +Issue / Volume / Pages / Chapter + +## Anything Else? + +Known file / copy / url +Citations (outbound) + +# Queries / Searches / Views + +Views: work, release, creator, container, publisher + +Lookup by identifier + +# Other Workflows/Editors + +Single-creator-oriented helper to find works and disambiguate authorship + +Bulk author disambiguation helper (find other unresolved authors with same +alias text and select; drag works between columns) + +Bulk query-then-edit UI: search results in a table, edit like a spreadsheet, up +to... dozens? Query and then apply delta (eg, set topic)? Eg, author edits +basic metadata for all their citations all at once. + +Release editor + +Merge containers (and all related releases) +Merge entities (works, releases, etc) +Move release between works +Split entities (works, authors, etc), including linked stuff + diff --git a/notes/rust_libraries.txt b/notes/rust_libraries.txt deleted file mode 100644 index d5c8c18a..00000000 --- a/notes/rust_libraries.txt +++ /dev/null @@ -1,41 +0,0 @@ - -libs: -- iron_slog -- testing: keep it simple: iron-test - => if that is annoying, shiny? mockers if needed. -- sentry -- start with dotenv+clap, then config-rs? -- cadence (emits statsd) -- frank_jwt and JWT for (simple?) auth - -metrics: -- best would be something with a configurable back-end, like 'log' for logging, - but supporing tags/labels. the prometheus model probably makes most sense by - default (really nice to be able to grab metrics with 'curl'/browser for - individual instances), but statsd seems to be what we run in production. not - spewing out lots of UDP by default seems like a good idea. -- dipstick: has all the good features, and popular, but code quality has smells - ("a32dlkjhw"-style commit messages), and API doesn't seem very clean. Also - prometheus stuff not actually implemented -- cadence: seems stable, somewhat popular, clean API. statsd-only for now, but - has custom backends that could be hooked on to. *super* few dependencies, - nice. -- tic: many deps; doesn't seem stable or under development -- rust-prometheus: developed by pingcap (large company). has push and pull - features. medum-sized deps; has feature flags - -A nice feature of a statsd solution is that collectd is usually running -locally (on linux dev, or in production), and metrics can be sent there by -default, like journald for logging. - -Seems like a decision between cadence (statsd) and rust-prometheus. - -similar: -- https://github.com/DavidBM/templic-backend -- https://github.com/alexanderbanks/rust-api -- https://mgattozzi.com/diesel-powered-rocket -- https://www.reddit.com/r/rust/comments/8j1xbs/new_to_rust_and_gitlab_ci/ -- https://crate-ci.github.io/ - -"cool tools": -- cargo-watch diff --git a/notes/test_works.txt b/notes/test_works.txt deleted file mode 100644 index 59b01701..00000000 --- a/notes/test_works.txt +++ /dev/null @@ -1,77 +0,0 @@ - -http://mathsci.wikia.com/wiki/The_Haruhi_Problem - -## Found because Famous - -Many co-authors (group): - - "Precision measurement of the top-quark mass in lepton+jets final states" - https://arxiv.org/abs/1405.1756 - -"Fake" creator: Bourbaki - -"Fake" works: John Bohanon sting operations, previous scandals - -## Found in Testing Imports - -Two releases, same work (actually same release?): - - Freiheit für Nutzer, nicht für Software - 10.14361/transcript.9783839420362.366 - 10.14361/9783839428351-056 - - May also have link via crossref metadata? - -Fun ellen examples: - - Just-in-time databases and the World-Wide Web - 10.1145/288627.288638 - - Two different versions of PDF found, same URL - -Actual ORCID match: - - 10.1002/cfg.158 - 0000-0002-4447-5978 - -Fulltext via CORE publisher-connector: - - 10.1186/s12889-016-2706-9 - -Fake/example DOI: 10.5555/12345678 -ORCID: 0000-0002-1825-0097 -ISSN (invalid?): 0264-3561 - -We have fulltext via long-tail; only Google also has a copy: - ON DECOMPOSITIONS OF THE IDENTITY OPERATOR INTO A LINEAR COMBINATION OF ORTHOGONAL PROJECTIONS - http://mfat.imath.kiev.ua/article/?id=543 - 2010, open access - Institute of Mathematics NAS of Ukraine - "arXiv overlay journal" - sha1=0d39d932aad191fe8ed07572d96260ee4fad26aa - -Very large authorship/reference lists: - -- 10.1038/nature.2015.17567 (not in crossref metadata) -- 10.1038/nature14474 -- 10.1534/g3.114.015966 - -DOIs same except for an extra slash: - - 10.1037/0003-066x.39.1.40 - 10.1037//0003-066x.39.1.40 - -## Missing - -"ACE: A Novel Software Platform to Ensure the Integrity [...]" - -"Periods of Twenty-five Variable Stars in the Small Magellanic Cloud" by -Leavitt, Henrietta -=> shows as a chapter, not the original paper -=> in google scholar as "Periods of 25 Variable Stars in the Small Magellanic Cloud.", as well as several other harvard.edu results - -"Browser history re:visited" -=> no DOI; conference proceeding -=> in google scholar -=> random un-published version at https://www.spinda.net/; "The copy of the - paper hosted here has been updated to reflect [...]" diff --git a/notes/webface_notes.txt b/notes/webface_notes.txt deleted file mode 100644 index 37a56c5c..00000000 --- a/notes/webface_notes.txt +++ /dev/null @@ -1,62 +0,0 @@ - -# CSS/JS Libraries - -tachyons is nice for simple css-only stuff, but let's use "Semantic UI" because -it has a bunch of javascript form stuff. - - - - - -# "Add Something" Workflow - -## Add a Work - -Title -Primary Type -Primary Creators/Authors -Description (not an abstract) -Primary/Original Language -Subject/Categorization/Tags -Is a Stub (unpublished/unreleased) - -## Release Information - -Contributors -Date -Container / Part-Of -Publisher -Identifiers -Language -Type / Media -Issue / Volume / Pages / Chapter - -## Anything Else? - -Known file / copy / url -Citations (outbound) - -# Queries / Searches / Views - -Views: work, release, creator, container, publisher - -Lookup by identifier - -# Other Workflows/Editors - -Single-creator-oriented helper to find works and disambiguate authorship - -Bulk author disambiguation helper (find other unresolved authors with same -alias text and select; drag works between columns) - -Bulk query-then-edit UI: search results in a table, edit like a spreadsheet, up -to... dozens? Query and then apply delta (eg, set topic)? Eg, author edits -basic metadata for all their citations all at once. - -Release editor - -Merge containers (and all related releases) -Merge entities (works, releases, etc) -Move release between works -Split entities (works, authors, etc), including linked stuff - -- cgit v1.2.3