notes: commit a whole bunch of random notes and files

author: Bryan Newbold <bnewbold@robocracy.org> 2023-01-04 20:00:26 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2023-01-04 20:18:25 -0800
commit: dca72aa11d24cbe8272c86d221a400c9859fb7e3 (patch)
tree: 3e58017efb2d6ce72f9bed07269d7b6c17aa3068 /notes/misc
parent: 276ac2aa24166660bc6ffe7601cee44b5d848dae (diff)
download: fatcat-dca72aa11d24cbe8272c86d221a400c9859fb7e3.tar.gz
fatcat-dca72aa11d24cbe8272c86d221a400c9859fb7e3.zip
18 files changed, 691 insertions, 0 deletions
diff --git a/notes/misc/2020-08_metadata.md b/notes/misc/2020-08_metadata.md
new file mode 100644
index 00000000..12cd6fb0
--- /dev/null
+++ b/notes/misc/2020-08_metadata.md
@@ -0,0 +1,79 @@
+
+## Artificial Containers
+
+    biorxiv
+    medrxiv
+        doi_prefix:10.1101
+        publisher:"Cold Spring Harbor Laboratory"
+    -> article-journal? article? should match "paper" filter
+    -> status: draft? submitted?
+    -> there is some flag in crossref metadata...
+
+    arxiv
+    -> article-journal?
+    -> set container_name?
+
+    protocols.io
+        doi_prefix:10.17504/protocols.io.
+        container_name:protocols.io
+
+    10.25384/sage. -> sage.figshare.com
+    -> at least set container_name
+
+    figshare
+        doi_prefix:10.6084
+    -> at least set container_name
+
+    zenodo
+    -> at least set container_name
+
+Maybe? Later?
+
+    PsycEXTRA
+        container_name:"PsycEXTRA Dataset"
+        doi_prefix:10.1037
+        crossref
+    => 300k+ releases
+    => subtitle is 'number' (like "(577982012-038)")
+    => dataset
+    => publication status unknown
+
+    f1000 reviews
+        container_name:"F1000 - Post-publication peer review of the biomedical literature"
+        title:"Faculty of 1000 evaluation for "[...]
+        doi_prefix:10.3410/
+        crossref
+    => 222k releases
+    => type -> peer-review (?)
+
+    IUPAC Standards Online
+
+    GBIF
+        doi_prefix: 10.15468/dl.
+    => 838k releases
+
+==================
+
+
+later fatcat:
+- pmid+crossref pre-prints
+    https://fatcat.wiki/release/d4lrxugtqbapxgi4jrrlmzjily
+- zenodo: handle "repost from another ISSN" case (drop issn/container_id)
+- doi_prefix:10.18720 no container metadata; should be thesis type?
+- research square (10.21203) metadata (journal articles, pre-print or published?)
+- journals.ub.uni-heidelberg.de metadata is poor? no journal link
+- try_work_lookup() -> part of try update?
+    => zenodo "isidentical"
+    => zenodo "isversionof"
+    => figshare "isversionof"
+    => later, try_work_fuzzy()
+- biorxiv, medrxiv container name (and/or `container_id`?)
+    => and "article" not "post"
+- datacite container:"microPublication Biology" -> micropub type?
+- ES container index: `publisher_type` (?)
+- arxiv: remove release_type="report" logic
+- arxiv: don't include DOI, just merge under work
+- datacite release_type: resourceType=SaComponent -> 'component'
+    https://api.datacite.org/dois/10.1371/journal.pbio.0020429.g004
+- datacite title `{:unav}` (PLOS)
+    https://fatcat.wiki/release/search?q=doi_prefix%3A10.1371+unav
diff --git a/notes/misc/2020_ingest_ideas.md b/notes/misc/2020_ingest_ideas.md
new file mode 100644
index 00000000..fc5ea807
--- /dev/null
+++ b/notes/misc/2020_ingest_ideas.md
@@ -0,0 +1,14 @@
+
+https://philpapers.org/ 
+=> 2.4m entries
+
+https://philarchive.org/
+=> 50k OA papers
+=> OAI-PMH
+
+https://isidore.science/
+=> humanities in french, english, spanish
+=> APIs
+
+http://ascl.net/
+=> Astrophysics Source Code Library
diff --git a/notes/misc/2022-04_missing_oa.md b/notes/misc/2022-04_missing_oa.md
new file mode 100644
index 00000000..9a5541b9
--- /dev/null
+++ b/notes/misc/2022-04_missing_oa.md
@@ -0,0 +1,202 @@
+
+Short data exploration of what OA content is missing, and how it might be crawled.
+
+Starting with "front page" query:
+
+    is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084
+
+    doi_prefix:10.6084 is figshare
+    doi_prefix:10.5281 is zenodo
+
+    14,658,673	66.56%	preserved and publicly accessible (bright)
+    3,453,052	15.68%	preserved but not publicly accessible (dark)
+    3,911,614	17.77%	no known independent preservation
+    22,023,339	100%	total
+
+Virtually all of the "dark" is also `in_shadows:true`. So the
+`preservation:none` is the high-impact target for crawling.
+
+Limiting to `publisher_type:big5`, almost zero `preservation:none`, and 1.34
+million (41%) dark.
+
+## Publisher Type
+
+Created a kibana graph of the above filters, graphing `publisher_type` ("Publisher Type breakdown of missing OA)":
+
+    <missing>   1769k   54%
+    longtail     852k   26%
+    society      195k    6%
+    unipress     130k    4%
+    scielo       114k    3.5%
+    then: repository, oa, commercial, big5
+
+## Containers
+
+    !container_id:* preservation:none is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084
+
+    1,993,639 missing preservation
+
+These are virtually all Datacite DOIs (not including figshare/zenodo), and
+start in 2008, ramping up. They are almost all missing `publisher_type` (which
+makes sense because they have no container).
+
+With the filters from above, here are some top containers missing content:
+
+    Missing	                    1,993,639
+    e27twid5qnbqbboxlkrja2xz2a	12,537
+        "Proceedings of Indian National Science Academy"
+        almost zero preservation. DOAJ website is 404 for article (!), no longer in DOAJ (!)
+        some kind of bad metadata situation? almost all from 2015
+    fmoqnzpewvfrnm2ni4mbvvlney	9,350
+        "Chinese Medical Journal"
+        PMIDs only
+        missing/unpreserved is pre-2015 (significant!)
+    7l5xye7sc5emxfprwmqw2a7yxq	8,999
+        "Tidsskrift for Den norske legeforening" (norwegian medical)
+        bunch of PMIDs only; sporadic preservation coverage
+    ujftxdg3knebxhrqg4qjznz2he	5,903
+        "International Research Journal" (russian)
+        these are by-issue, with DOIs redirecting to pages inside issue (!)
+    kfzef6kfwbhpnfw3cifit7zw7q	5,678
+        "lectures"
+        hosted on openeditions
+        HTML ingest would work (!)
+    gr4g5qzzcnembf4om6yjb6qf34	5,020
+        "计算机科学"
+        mostly via dblp. some DOIs, presumably chinese?
+    bl77onlbbbhu5d6ohpjw2ypojy	4,994
+        "EOS" (from American Geophysical Union / AGU)
+        large publication, mostly preserved (dark)
+        mix of wiley.com OA (but hard to crawl?) and web/HTML stuff
+    3afvqhtpnjd5nmiphwxlxzirde	4,877
+        "Medical Science Monitor"
+        large publication, mixed preservation
+        annoying PDF link situation (hard to crawl?)
+    tulajqojzjabfc4iybyv6poi2e	4,786
+        "Dermatology Online Journal"
+        large publication, mixed preservation
+        some just pmid
+        some HTML or ePub-only
+        escholarship.org
+
+A take-away here for me is that containers are pretty heterogenous and have
+diverse issues.
+
+TODO: ingest things like: https://escholarship.org/uc/item/02v86610
+    from container_tulajqojzjabfc4iybyv6poi2e
+
+### revues.org / openedition
+
+Many of these seem like they would ingest fine via HTML.
+
+    doi_prefix:10.4000
+
+    151,565	34.3%	preserved and publicly accessible (bright)
+      7,211	1.64%	preserved but not publicly accessible (dark)
+    283,139	64.08%	no known independent preservation
+    441,915	100%	total
+
+    article-journal	230,146	    63% preserved
+    chapter	        200,724	     2% preserved
+    book	        10,971	    12% preserved
+    paper-conference	74
+
+Chapters and books don't seem as amenable to ingest... and indeed are mostly
+not marked `is_oa:true`.
+
+DONE: bulk html-mode ingest, expecting about 80k requests:
+
+    doi_prefix:10.4000 in_ia:false type:article-journal is_oa:true
+
+    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc280.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc350.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-bulk \
+        --ingest-type html \
+        query "doi_prefix:10.4000 in_ia:false type:article-journal is_oa:true"
+    => Expecting 80032 release objects in search queries
+    => Counter({'ingest_request': 80032, 'elasticsearch_release': 80032, 'estimate': 80032, 'kafka': 80032})
+
+NOTE: have this be the default ingest type for this DOI prefix? not sure, some
+do come through as PDF just fine
+
+## Source of Records
+
+Starting with the 3,844,142 or so `preservation:none`.
+
+    doi                 3.204m
+        datacite            1.995m
+        crossref            1.087m
+        <unknown>           109k
+        jalc                12k
+    doaj_id             553k
+    pmid                192k
+    dblp_id             29k
+    arxiv_id, pmcid     0
+
+I'm surprised how good dblp coverage is? Oh, but those are almost entirely
+missing OA status, that explains it.
+
+    # NOTE: not specifically OA
+    dblp_id:* year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference)
+
+    406,235	    22.54%	preserved and publicly accessible (bright)
+    59,009	    3.28%	preserved but not publicly accessible (dark)
+    1,337,554	74.2%	no known independent preservation
+    1,802,798	100%	total
+
+Looks like doi and DOAJ are big sources.
+
+    # NOTE: DOAJ implies OA, I checked and numbers are ~same
+    doaj_id:* is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference)
+
+    588,364	    47.27%	preserved and publicly accessible (bright)
+    103,206	    8.3%	preserved but not publicly accessible (dark)
+    553,353	    44.45%	no known independent preservation
+    1,244,923	100%	total
+
+DOAJ ingest seems important to optimize!
+
+    !publisher_type:big5 container_id:* doaj_id:* is_oa:true year:>1995 year:<=2021 (type:article-journal OR type:article OR type:paper-conference)
+    => 548,709 missing preservation
+
+    doaj_id:*
+    => 589,915 missing preservation
+
+Datacite the biggest category though, even with zenodo/figshare removed.
+
+TODO: largest datacite DOI prefixes
+TODO: check sandcrawler DB to see DOAJ ingest status; maybe these are entirely missing URLs? or just not crawling well?
+TODO: dig in to "longtail" more... some random ones?
+
+## Largest DOI Prefixes
+
+    <missing>	640,104
+    10.48550 	1,543,167
+        the new arxiv.org prefix
+    10.4000	68,267
+        revues / openedition (handled above)
+    10.25384	60,063
+        figshare / SAGE
+    10.3917	52,195
+        cairn.info
+    10.25673	41,565
+        some random IR? opendata.uni-halle.de
+        TODO: ingest this type of item, possibly using dataset->file crawler
+    10.3406	33,778
+        persee.fr
+        blocks bots (don't attempt ingest)
+    10.3205	33,540
+        "german medical science"
+        HTML articles, PDF links
+        TODO: fix ingest
+        https://www.egms.de/static/en/journals/gms/2020-18/000284.shtml
+    10.17605	30,365
+        osf.io
+        TODO: fix ingest (?)
+    10.25446	26,614
+        figshare / oxford
+        "File(s) not publicly available"
+        but "CC BY 4.0"? ugh
+
+TODO: HTML crawl cairn.info (10.3917)
+TODO: ignore 10.25384, 10.25446 (figshare)
+TODO: ignore arixv.org prefix (10.48550) in default dashboard
+TODO: handle arxiv.org DOIs better (merge, count as preserved, etc)
diff --git a/notes/misc/UNSORTED.txt b/notes/misc/UNSORTED.txt
new file mode 100644
index 00000000..850b54d0
--- /dev/null
+++ b/notes/misc/UNSORTED.txt
@@ -0,0 +1,40 @@
+
+Not allowed to PUT edits to the same entity in the same editgroup. If you want
+to update an edit, need to delete the old one first.
+
+The state depends only on the current entity state, not any redirect. This
+means that if the target of a redirect is deleted, the redirecting entity is
+still "redirect", not "deleted".
+
+Redirects-to-redirects are not allowed; this is enforced when the editgroup is
+accepted, to prevent race conditions.
+
+Redirects to "work-in-progress" (WIP) rows are disallowed at update time (and
+not re-checked at accept time).
+
+"ident table" parameters are ignored for entity updates. This is so clients can
+simply re-use object instantiations.
+
+The "state" parameter of an entity body is used as a flag when deciding whether
+to do non-normal updates (eg, redirect or undelete, as opposed to inserting a
+new revision).
+
+In the API, if you, eg, expand=files on a redirected release, you will get
+files that point to the *target* release entity. If you use the /files endpoint
+(instead of expand), you will get the files pointing to the redirected entity
+(which probably need updating!). Also, if you expand=files on the target
+entity, you *won't* get the files pointing to the redirected release. A
+high-level merge process might make these changes at the same time? Or at least
+tag at edit review time. A sweeper task can look for and auto-correct such
+redirects after some delay period.
+
+=> it would not be too hard to update get_release_files to check for such
+   redirects; could be handled by request flag?
+
+`prev_rev` is naively set to the most-recent previous state. If the current
+state was deleted or a redirect, it is set to null.
+
+This parameter is not checked/enforced at edit accept time (but could be, and
+maybe introduce `prev_redirect`, for race detection). Or, could have ident
+point to most-recent edit, and have edits point to prev, for firmer control.
+
diff --git a/notes/misc/example_entities.txt b/notes/misc/example_entities.txt
new file mode 100644
index 00000000..e4016d8a
--- /dev/null
+++ b/notes/misc/example_entities.txt
@@ -0,0 +1,58 @@
+
+errata/update:
+    Fourth Test of General Relativity: Preliminary Results
+    10.1103/physrevlett.20.1265
+    10.1103/physrevlett.21.266.3 
+
+    same title; later is errata to the first.
+    very minor: The term "baud length" was consistently misprinted as "band length."
+
+DOIs for individual images
+    https://commons.wikimedia.org/wiki/Category:Media_from_Williams_et_al._2010_-_10.1371/journal.pone.0010676
+
+long-tail journal not in fatcat; web-native, tricky to crawl
+    https://angryoldmanmagazine.com/
+
+dataset
+    "ISSN-Matching of Gold OA Journals (ISSN-GOLD-OA) 2.0"
+    https://pub.uni-bielefeld.de/data/2913654
+    2 files
+    has DOI: 10.4119/unibi/2913654
+
+release group; single PDF is valid copy of two DOIs:
+    https://fatcat.wiki/file/wr64e37yvfcidgbowtslx7omne
+    10.5167/uzh-146424
+    10.1016/j.physletb.2017.12.006
+    ALSO: has CC-BY license_slug
+
+bad MAG match:
+
+    https://fatcat.wiki/release/b65rjfixxbh4zjd3zxcjdz2b6e
+    https://academic.microsoft.com/paper/2535407850
+    MAG has wrong metadata? have not corrected in fatcat
+
+
+## Long-Tail Content
+
+humanities journal; content in SIM and Proquest, no Keepers, no DOIs:
+
+    Clio: A Journal of Literature, History, and the Philosophy of History
+    https://fatcat.wiki/container/bsn7fpeyx5ep7eyjgxxd5oygsa
+
+### Examples from Twitter
+
+Thread from 2021: <https://twitter.com/internetarchive/status/1361329860254130181>
+
+- Granta Magazine
+- Punk Planet (in IA?)
+- Black Clock (https://en.wikipedia.org/wiki/Black_Clock)
+- Le Grand Jeu
+- ILK Journal (in wayback: http://web.archive.org/web/20160331182524/http://ilkjournal.com/journal/issue-fourteen/roberto-montes/)
+
+
+### Vanished Content
+
+"Abril"
+https://fatcat.wiki/container/stdnbbwbpzflzhp2syctupqtc4
+    in DOAJ
+    broken DOIs, but new website does exist?
diff --git a/notes/misc/examples/content_scope.txt b/notes/misc/examples/content_scope.txt
new file mode 100644
index 00000000..321dd056
--- /dev/null
+++ b/notes/misc/examples/content_scope.txt
@@ -0,0 +1,45 @@
+
+sha1:fe27d2d036d478fb692be95045b72773e0dc27ac
+https://fatcat.wiki/file/4tcvwhzunrgvri4x3uruug62jq
+
+    cover page... an ILL request? via ILL request.
+
+    "metadata": {
+        "author": "Emmanuel Lemoine",
+        "creator": "Okina",
+        "producer": "mPDF 6.0",
+        "title": "Chloro complexes of cobalt(II) in aprotic solvents: stability and structural modifications due to solvent effect"
+    },
+    "pdf_created": "2017-01-26T10:43:21+00:00",
+    "pdf_version": "1.4",
+    "permanent_id": "2d231660c0e26f92aad7cb2f62b5e03a",
+
+    SELECT *
+    FROM pdf_meta
+    WHERE
+        status = 'success'
+        AND page_count < 3
+        AND (metadata->>'creator')::text = 'Okina'
+    LIMIT 5;
+
+    SELECT COUNT(*)
+    FROM pdf_meta
+    WHERE
+        status = 'success'
+        AND page_count < 3
+        AND (metadata->>'creator')::text = 'Okina'
+    ;
+    # 4235
+
+    TODO: 'COPY TO'...
+
+    SELECT pdf_meta.sha1hex
+    FROM pdf_meta
+    LEFT JOIN fatcat_file ON pdf_meta.sha1hex = fatcat_file.sha1hex
+    WHERE
+        status = 'success'
+        AND page_count < 3
+        AND (metadata->>'creator')::text = 'Okina'
+        AND (metadata->>'publisher')::text LIKE 'mPDF%'
+        AND fatcat_file.ident IS NOT NULL
+    ;
diff --git a/notes/misc/examples/grobid_500.txt b/notes/misc/examples/grobid_500.txt
new file mode 100644
index 00000000..5e64c781
--- /dev/null
+++ b/notes/misc/examples/grobid_500.txt
@@ -0,0 +1,4 @@
+
+seems like a legit/fine PDF file:
+https://fatcat.wiki/file/nrydu6nutvedximcb4lpdsrp6u
+
diff --git a/notes/misc/examples/personal_favorites.md b/notes/misc/examples/personal_favorites.md
new file mode 100644
index 00000000..2ecee2d8
--- /dev/null
+++ b/notes/misc/examples/personal_favorites.md
@@ -0,0 +1,2 @@
+
+International Journal of Crashworthiness
diff --git a/notes/misc/examples/random_journals.txt b/notes/misc/examples/random_journals.txt
new file mode 100644
index 00000000..f5cb0e69
--- /dev/null
+++ b/notes/misc/examples/random_journals.txt
@@ -0,0 +1,5 @@
+
+"Rejecta Mathematica"
+only published articles which failed peer review.
+no longer online, but may be in wayback
+https://en.wikipedia.org/wiki/Rejecta_Mathematica
diff --git a/notes/misc/examples/random_works.txt b/notes/misc/examples/random_works.txt
new file mode 100644
index 00000000..3f5bb7e3
--- /dev/null
+++ b/notes/misc/examples/random_works.txt
@@ -0,0 +1,9 @@
+
+"The limitations of using languages for description", Marvin Minsky
+http://web.mit.edu/dxh/www/1970_Marvin_Lecture_Transcript_Italy_Limitations_Language.pdf
+
+"A Supercut of Supercuts: Aesthetics, Histories, Databases"
+https://vimeo.com/440746435
+https://www.openscreensjournal.com/article/id/6946/
+
+Dummy article in springer (paywalled!): https://doi.org/10.1007/s10096-005-0027-5
diff --git a/notes/misc/examples/video_works.txt b/notes/misc/examples/video_works.txt
new file mode 100644
index 00000000..6c0a450f
--- /dev/null
+++ b/notes/misc/examples/video_works.txt
@@ -0,0 +1,4 @@
+
+https://doi.org/10.24350/cirm.v.19933803
+    "Imaging with nonlinear and fractionally damped waves"
+    https://library.cirm-math.fr/Record.htm?record=19280247124910084299&confirm=on
diff --git a/notes/misc/horror_stories.md b/notes/misc/horror_stories.md
new file mode 100644
index 00000000..eaac48e7
--- /dev/null
+++ b/notes/misc/horror_stories.md
@@ -0,0 +1,10 @@
+
+Two different DOIs for the same work, from different publishers:
+
+    Intravenous Administration of Human γ-Globulin
+    S. Barandun, P. Kistler, F. Jeunet, H. Isliker
+    1962, Vox Sanguinis
+
+    https://fatcat.wiki/release/search?q=%22Intravenous+administration+of+human+%CE%B3-globulin%22&generic=1
+    10.1111/j.1423-0410.1962.tb03240.x
+    10.1159/000464763
diff --git a/notes/misc/rust_libraries.txt b/notes/misc/rust_libraries.txt
new file mode 100644
index 00000000..d5c8c18a
--- /dev/null
+++ b/notes/misc/rust_libraries.txt
@@ -0,0 +1,41 @@
+
+libs:
+- iron_slog
+- testing: keep it simple: iron-test
+    => if that is annoying, shiny? mockers if needed.
+- sentry
+- start with dotenv+clap, then config-rs?
+- cadence (emits statsd)
+- frank_jwt and JWT for (simple?) auth
+
+metrics:
+- best would be something with a configurable back-end, like 'log' for logging,
+  but supporing tags/labels. the prometheus model probably makes most sense by
+  default (really nice to be able to grab metrics with 'curl'/browser for
+  individual instances), but statsd seems to be what we run in production. not
+  spewing out lots of UDP by default seems like a good idea.
+- dipstick: has all the good features, and popular, but code quality has smells
+  ("a32dlkjhw"-style commit messages), and API doesn't seem very clean. Also
+  prometheus stuff not actually implemented
+- cadence: seems stable, somewhat popular, clean API. statsd-only for now, but
+  has custom backends that could be hooked on to. *super* few dependencies,
+  nice.
+- tic: many deps; doesn't seem stable or under development
+- rust-prometheus: developed by pingcap (large company). has push and pull
+  features. medum-sized deps; has feature flags
+
+A nice feature of a statsd solution is that collectd is usually running
+locally (on linux dev, or in production), and metrics can be sent there by
+default, like journald for logging.
+
+Seems like a decision between cadence (statsd) and rust-prometheus.
+
+similar:
+- https://github.com/DavidBM/templic-backend
+- https://github.com/alexanderbanks/rust-api
+- https://mgattozzi.com/diesel-powered-rocket
+- https://www.reddit.com/r/rust/comments/8j1xbs/new_to_rust_and_gitlab_ci/
+- https://crate-ci.github.io/
+
+"cool tools":
+- cargo-watch
diff --git a/notes/misc/test_works.txt b/notes/misc/test_works.txt
new file mode 100644
index 00000000..59b01701
--- /dev/null
+++ b/notes/misc/test_works.txt
@@ -0,0 +1,77 @@
+
+http://mathsci.wikia.com/wiki/The_Haruhi_Problem
+
+## Found because Famous
+
+Many co-authors (group):
+
+    "Precision measurement of the top-quark mass in lepton+jets final states"
+    https://arxiv.org/abs/1405.1756
+
+"Fake" creator: Bourbaki
+
+"Fake" works: John Bohanon sting operations, previous scandals
+
+## Found in Testing Imports
+
+Two releases, same work (actually same release?):
+
+    Freiheit für Nutzer, nicht für Software
+    10.14361/transcript.9783839420362.366 
+    10.14361/9783839428351-056 
+
+    May also have link via crossref metadata?
+
+Fun ellen examples:
+
+    Just-in-time databases and the World-Wide Web
+    10.1145/288627.288638 
+
+    Two different versions of PDF found, same URL
+
+Actual ORCID match:
+
+    10.1002/cfg.158
+    0000-0002-4447-5978
+
+Fulltext via CORE publisher-connector:
+
+    10.1186/s12889-016-2706-9 
+
+Fake/example DOI: 10.5555/12345678
+ORCID: 0000-0002-1825-0097
+ISSN (invalid?): 0264-3561
+
+We have fulltext via long-tail; only Google also has a copy:
+    ON DECOMPOSITIONS OF THE IDENTITY OPERATOR INTO A LINEAR COMBINATION OF ORTHOGONAL PROJECTIONS
+    http://mfat.imath.kiev.ua/article/?id=543
+    2010, open access
+    Institute of Mathematics NAS of Ukraine
+    "arXiv overlay journal"
+    sha1=0d39d932aad191fe8ed07572d96260ee4fad26aa
+
+Very large authorship/reference lists:
+
+- 10.1038/nature.2015.17567 (not in crossref metadata)
+- 10.1038/nature14474
+- 10.1534/g3.114.015966
+
+DOIs same except for an extra slash:
+
+    10.1037/0003-066x.39.1.40
+    10.1037//0003-066x.39.1.40
+
+## Missing
+
+"ACE: A Novel Software Platform to Ensure the Integrity [...]"
+
+"Periods of Twenty-five Variable Stars in the Small Magellanic Cloud" by
+Leavitt, Henrietta
+=> shows as a chapter, not the original paper
+=> in google scholar as "Periods of 25 Variable Stars in the Small Magellanic Cloud.", as well as several other harvard.edu results
+
+"Browser history re:visited"
+=> no DOI; conference proceeding
+=> in google scholar
+=> random un-published version at https://www.spinda.net/; "The copy of the
+   paper hosted here has been updated to reflect [...]"
diff --git a/notes/misc/thesis_uk.md b/notes/misc/thesis_uk.md
new file mode 100644
index 00000000..cbcca6d5
--- /dev/null
+++ b/notes/misc/thesis_uk.md
@@ -0,0 +1,6 @@
+
+large number of doctoral thesis metadata, from EThOS
+https://bl.iro.bl.uk/concern/datasets/c815b271-09be-4123-8156-405094429198
+
+will get via OAI-PMH, presumably. but, requires login for actual download?
+sigh.
diff --git a/notes/misc/unsorted.txt b/notes/misc/unsorted.txt
new file mode 100644
index 00000000..17ff839c
--- /dev/null
+++ b/notes/misc/unsorted.txt
@@ -0,0 +1,19 @@
+
+fatcat misc:
+- opencitations: https://arxiv.org/abs/1906.11964
+- https://pub.uni-bielefeld.de/record/2934907
+- re-read: scratch/issn/web_archiving.md
+- should expansion of 'wip' entities be allowed?
+- could now just not show 'wip' entities (unless part of editgroup)
+-  release_ref | 19904400 | Missing Index? |  4141039616 | 81833687 |  61929287
+- privacy/security issue with libmacaroon logging failed caveat verification
+- blank box on editgroup pages when not logged in
+- don't have "Editable catalog of bibliographic and fulltext file metadata" be the thing in snippets?
+- web: '|dictsort' in a bunch of places (for stability)
+- example HTML paper: https://andrewgyork.github.io/rescan_line_sted/
+- pubmed importer should include section in ALLCAPS: for multi-part abstracts
+- https://github.com/rholder/retrying
+- feature: push-button "update metadata from crossref"
+- demo ORCID: 0000-0002-1825-0097
+- link: https://www.jstor.org/dfr/about/technical-specifications
+- after indexing, optimise the Elasticsearch index by merging into a single segment: curl -XPOST 'http://localhost:9200/scholar/_forcemerge?max_num_segments=1'
diff --git a/notes/misc/webface_iteration.md b/notes/misc/webface_iteration.md
new file mode 100644
index 00000000..a7f11d15
--- /dev/null
+++ b/notes/misc/webface_iteration.md
@@ -0,0 +1,14 @@
+
+## Design Examples
+
+metamath
+
+- example: <https://wapm.io/package/liftm/metamath>
+- somewhat similar to existing fatcat release layout
+- tabs are better? tabs scroll left/right on mobile
+- CSS/etc is heavy, though design is simple
+
+lib.rs
+
+sourcehut
+
diff --git a/notes/misc/webface_notes.txt b/notes/misc/webface_notes.txt
new file mode 100644
index 00000000..37a56c5c
--- /dev/null
+++ b/notes/misc/webface_notes.txt
@@ -0,0 +1,62 @@
+
+# CSS/JS Libraries
+
+tachyons is nice for simple css-only stuff, but let's use "Semantic UI" because
+it has a bunch of javascript form stuff.
+
+    <link rel="stylesheet" href="https://cdn.jsdelivr.net/semantic-ui/2.2.13/semantic.min.css">
+    <script src="https://cdn.jsdelivr.net/semantic-ui/2.2.13/semantic.min.js"></script>
+
+
+# "Add Something" Workflow
+
+## Add a Work
+
+Title
+Primary Type
+Primary Creators/Authors
+Description (not an abstract)
+Primary/Original Language
+Subject/Categorization/Tags
+Is a Stub (unpublished/unreleased)
+
+## Release Information
+
+Contributors
+Date
+Container / Part-Of
+Publisher
+Identifiers
+Language
+Type / Media
+Issue / Volume / Pages / Chapter
+
+## Anything Else?
+
+Known file / copy / url
+Citations (outbound)
+
+# Queries / Searches / Views
+
+Views: work, release, creator, container, publisher
+
+Lookup by identifier
+
+# Other Workflows/Editors
+
+Single-creator-oriented helper to find works and disambiguate authorship
+
+Bulk author disambiguation helper (find other unresolved authors with same
+alias text and select; drag works between columns)
+
+Bulk query-then-edit UI: search results in a table, edit like a spreadsheet, up
+to... dozens? Query and then apply delta (eg, set topic)? Eg, author edits
+basic metadata for all their citations all at once.
+
+Release editor
+
+Merge containers (and all related releases)
+Merge entities (works, releases, etc)
+Move release between works
+Split entities (works, authors, etc), including linked stuff
+
author	Bryan Newbold <bnewbold@robocracy.org>	2023-01-04 20:00:26 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2023-01-04 20:18:25 -0800
commit	dca72aa11d24cbe8272c86d221a400c9859fb7e3 (patch)
tree	3e58017efb2d6ce72f9bed07269d7b6c17aa3068 /notes/misc
parent	276ac2aa24166660bc6ffe7601cee44b5d848dae (diff)
download	fatcat-dca72aa11d24cbe8272c86d221a400c9859fb7e3.tar.gz fatcat-dca72aa11d24cbe8272c86d221a400c9859fb7e3.zip