From 5defd444135bc4adb0748b0d2b8c9b88708bdc1a Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Tue, 23 Mar 2021 21:42:32 -0700
Subject: proposals: add 2021 UI updates, and rename all to have a date in
 filename

---
 .../2020-05-11_microfilm_indexing_pipeline.md      |  30 ++++++
 proposals/2020-05-11_overview.md                   |  38 ++++++++
 proposals/2020-05-11_web_interface.md              |  69 +++++++++++++
 proposals/2020-05-16_fatcat_indexing_pipeline.md   |  54 +++++++++++
 proposals/2020-06-04_work_schema.md                | 108 +++++++++++++++++++++
 proposals/2020-10-20_kafka_update_pipeline.md      |  63 ++++++++++++
 proposals/2021-01-18_crude_query_parse.md          |  18 ++++
 proposals/2021-02-15_ui_updates.md                 |  53 ++++++++++
 proposals/2021_crude_query_parse.md                |  18 ----
 proposals/fatcat_indexing_pipeline.md              |  54 -----------
 proposals/kafka_update_pipeline.md                 |  63 ------------
 proposals/microfilm_indexing_pipeline.md           |  30 ------
 proposals/overview.md                              |  38 --------
 proposals/web_interface.md                         |  69 -------------
 proposals/work_schema.md                           | 108 ---------------------
 15 files changed, 433 insertions(+), 380 deletions(-)
 create mode 100644 proposals/2020-05-11_microfilm_indexing_pipeline.md
 create mode 100644 proposals/2020-05-11_overview.md
 create mode 100644 proposals/2020-05-11_web_interface.md
 create mode 100644 proposals/2020-05-16_fatcat_indexing_pipeline.md
 create mode 100644 proposals/2020-06-04_work_schema.md
 create mode 100644 proposals/2020-10-20_kafka_update_pipeline.md
 create mode 100644 proposals/2021-01-18_crude_query_parse.md
 create mode 100644 proposals/2021-02-15_ui_updates.md
 delete mode 100644 proposals/2021_crude_query_parse.md
 delete mode 100644 proposals/fatcat_indexing_pipeline.md
 delete mode 100644 proposals/kafka_update_pipeline.md
 delete mode 100644 proposals/microfilm_indexing_pipeline.md
 delete mode 100644 proposals/overview.md
 delete mode 100644 proposals/web_interface.md
 delete mode 100644 proposals/work_schema.md

diff --git a/proposals/2020-05-11_microfilm_indexing_pipeline.md b/proposals/2020-05-11_microfilm_indexing_pipeline.md
new file mode 100644
index 0000000..657aae2
--- /dev/null
+++ b/proposals/2020-05-11_microfilm_indexing_pipeline.md
@@ -0,0 +1,30 @@
+
+## High-Level
+
+- operate on an entire item
+- check against issue DB and/or fatcat search
+    => if there is fatcat work-level metadata for this issue, skip
+- fetch collection-level (journal) metadata
+- iterate through djvu text file:
+    => convert to simple text
+    => filter out non-research pages using quick heuristics
+    => try looking up "real" page number from OCR work (in item metadata)
+- generate "heavy" intermediate schema (per valid page):
+    => fatcat container metadata
+    => ia collection (journal) metadata
+    => item metadata
+    => page fulltext and any metadata
+
+- transform "heavy" intermediates to ES schema
+
+## Implementation
+
+Existing tools and libraries:
+
+- internetarchive python tool to fetch files and item metadata
+- fatcat API client for container metadata lookup
+
+New tools or libraries needed:
+
+- issue DB or use fatcat search index to count releases by volume/issue
+- djvu XML parser
diff --git a/proposals/2020-05-11_overview.md b/proposals/2020-05-11_overview.md
new file mode 100644
index 0000000..fa8148c
--- /dev/null
+++ b/proposals/2020-05-11_overview.md
@@ -0,0 +1,38 @@
+
+
+Can be multiple releases for each work:
+
+- required: most canonical published version ("version of record", what would be cited)
+    => or, most updated?
+- optional: mostly openly accessible version
+- optional: updated version
+    => errata, corrected version, or retraction
+- optional: fulltext indexed version
+    => might be not the most updated, or no accessible
+
+
+## Initial Plan
+
+Index all fatcat works in catalog.
+
+Always link to a born-digital copy if one is accessible.
+
+Always link to a SIM microfilm copy if one is available.
+
+Use best available fulltext for search. If structured, like TEI-XML, index the
+body text separate from abstracts and references.
+
+
+## Other Ideas
+
+Do fulltext indexing at the granularity of pages, or some other segments of
+text within articles (paragraphs, chapters, sections).
+
+Fatcat already has all of Crossref, Pubmed, Arxiv, and several other
+authoritative metadata sources. But today we are missing a good chunk of
+content, particularly from institutional repositories and CS conferences (which
+don't use identifiers). Also don't have good affiliation or citation count
+coverage, and mixed/poor abstract coverage.
+
+Could use Microsoft Academic Graph (MAG) metadata corpus (or similar) to
+bootstrap with better metadata coverage.
diff --git a/proposals/2020-05-11_web_interface.md b/proposals/2020-05-11_web_interface.md
new file mode 100644
index 0000000..416e6fc
--- /dev/null
+++ b/proposals/2020-05-11_web_interface.md
@@ -0,0 +1,69 @@
+
+Single domain (TBD, but eg <https://scholar.archive.org>) will host a web
+search interface. May also expose APIs on this host, or might use a separate
+host for that.
+
+Content would not be hosted on this domain; all fulltext copies would be linked
+to elsewhere.
+
+Style (eg, colors, font?) would be similar to <https://archive.org>, but may or
+may not have regular top bar (<https://web.archive.org> has this). There would
+be no "write" or "modify" features on this site at all: users would not need to
+log in. Metadata updates and features would all redirect to archive.org or
+fatcat.wiki.
+
+
+## Design and Features
+
+Will try to hew most closely to Pubmed in style, layout, and features.
+
+Only a single search interface (no separate "advanced" page). Custom query
+parser.
+
+Filtering and sort via controls under search box. A button opens a box with
+more settings. If these are persisted at all, only via cookies or local
+storage.
+
+## URL Structure
+
+All pages can be prefixed with a two-character language specifier. Default
+(with no prefix) is english.
+
+`/`: homepage, single-sentance, large search box, quick stats and info
+
+`/about`: about
+
+`/help`: FAQ?
+
+`/help/search`: advanced query tips
+
+`/search`: query and results page
+
+
+## More Ideas
+
+Things we *could* do, but maybe *shouldn't*:
+
+- journal-level metadata and summary. Could just link to fatcat.
+
+
+## APIs
+
+Might also expose as public APIs on that domain:
+
+- search
+- citation matching
+- save-paper-now
+
+
+## Implementation
+
+For first iteration, going to use:
+
+- python3.7
+- elasticsearch-dsl from python and page-load-per-query (not single-page-app)
+- fastapi (web framework)
+- jinja2 (HTML templating)
+- babel (i18n)
+- semantic-ui (CSS)
+- minimal or no javascript
diff --git a/proposals/2020-05-16_fatcat_indexing_pipeline.md b/proposals/2020-05-16_fatcat_indexing_pipeline.md
new file mode 100644
index 0000000..deafb65
--- /dev/null
+++ b/proposals/2020-05-16_fatcat_indexing_pipeline.md
@@ -0,0 +1,54 @@
+
+## High-Level
+
+Work-oriented: base input is arrays of expanded releases, all from the same
+work.
+
+Re-index pipeline would look at fatcat changelog or existing release feed, and
+use the `work_id` to fetch all other releases.
+
+Batch indexing pipeline would use a new variant of `fatcat-export` which is
+expanded releases (one-per-line), grouped (or sorted) by work id.
+
+Then, pipeline looks like:
+
+- choose canonical release
+- choose best access
+- choose best fulltext file
+    => iterate releases and files
+    => soft prefer canonical release, file access, release_date, etc
+    => check via postgrest query that fulltext is available
+    => fetch raw fulltext
+- check if we expect a SIM copy to exist
+    => eg, using an issue db?
+    => if so, fetch petabox metadata and try to confirm, so we can create a URL
+    => if we don't have another fulltext source (?):
+        => fetch djvu file and extract the pages in question (or just 1 if unsure?)
+- output "heavy" object
+
+Next step is:
+
+- summarize biblio metadata
+- select one abstract per language
+- sanitize abstracts and fulltext content for indexing
+- compute counts, epistimological quality, etc
+
+The output of that goes to Kafka for indexing into ES.
+
+This indexing process is probably going to be both CPU and network intensive.
+In python will want multiprocessing and maybe also async?
+
+## Implementation
+
+Existing tools/libraries:
+
+- fatcat-openapi-client
+- postgrest client
+- S3/minio/seaweed client
+- ftfy
+- language detection
+
+New needed (eventually):
+
+- strip latex
+- strip JATS or HTML
diff --git a/proposals/2020-06-04_work_schema.md b/proposals/2020-06-04_work_schema.md
new file mode 100644
index 0000000..97d60ac
--- /dev/null
+++ b/proposals/2020-06-04_work_schema.md
@@ -0,0 +1,108 @@
+
+## Top-Level
+
+- type: `_doc` (aka, no type, `include_type_name=false`)
+- key: keyword (same as `_id`)
+- `collapse_key`: work ident, or SIM issue item (for collapsing/grouping search hits)
+- `doc_type`: keyword (work or page)
+- `doc_index_ts`: timestamp when document indexed
+- `work_ident`: fatcat work ident (optional)
+
+- `biblio`: obj
+- `fulltext`: obj
+- `ia_sim`: obj
+- `abstracts`: nested
+    body
+    lang
+- `releases`: nested (TBD)
+- `access`
+- `tags`: array of keywords
+
+TODO:
+- summary fields to index "everything" into?
+
+## Biblio
+
+Mostly matches existing `fatcat_release` schema.
+
+- `release_id`
+- `release_revision`
+- `title`
+- `subtitle`
+- `original_title`
+- `release_date`
+- `release_year`
+- `withdrawn_status`
+- `language`
+- `country_code`
+- `volume` (etc)
+- `volume_int` (etc)
+- `first_page`
+- `first_page_int`
+- `pages`
+- `doi` etc
+- `number` (etc)
+
+NEW:
+- `preservation_status`
+
+[etc]
+
+- `license_slug`
+- `publisher` (etc)
+- `container_name` (etc)
+- `container_id`
+- `container_issnl`
+- `container_wikidata_qid`
+- `issns` (array)
+- `contrib_names`
+- `affiliations`
+- `creator_ids`
+
+TODO: should all external identifiers go under `releases` instead of `biblio`? Or some duplicated?
+
+## Fulltext
+
+- `status`: web, sim, shadow
+- `body`
+- `lang`
+- `file_mimetype`
+- `file_sha1`
+- `file_id`
+- `thumbnail_url`
+
+## Abstracts
+
+Nested object with:
+
+- body
+- lang
+
+For prototyping, perhaps just make it an object with `body` as an array.
+
+Only index one abstract per language.
+
+## SIM (Microfilm)
+
+Enough details to construct a link or do a lookup or whatever. Note that might
+be doing CDL status lookups on SERP pages.
+
+- `issue_item`: str
+- `pub_collection`: str
+- `sim_pubid`: str
+- `first_page`: str
+
+
+Also pass-through archive.org metadata here (collection-level and item-level)
+
+## Access
+
+Start with obj, but maybe later nested?
+
+- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc
+- `mimetype`
+- `access_url`
+- `file_url`
+- `file_id`
+- `release_id`
+
diff --git a/proposals/2020-10-20_kafka_update_pipeline.md b/proposals/2020-10-20_kafka_update_pipeline.md
new file mode 100644
index 0000000..597a1b0
--- /dev/null
+++ b/proposals/2020-10-20_kafka_update_pipeline.md
@@ -0,0 +1,63 @@
+
+Want to receive a continual stream of updates from both fatcat and SIM
+scanning; index the updated content; and push into elasticsearch.
+
+
+## Filtering and Affordances
+
+The `updated` and `fetched` timestamps are not immediately necessary or
+implemented, but they can be used to filter updates. For example, after
+re-loading from a build entity dump, could "roll back" update pipeline to only
+fatcat (work) updates after the changelog index that the bulk dump is stamped
+with.
+
+At least in theory, the `fetched` timestamp could be used to prevent re-updates
+of existing documents in the ES index.
+
+The `doc_index_ts` timestamp in the ES index could be used in a future
+fetch-and-reindex worker to select documents for re-indexing, or to delete
+old/stale documents (eg, after SIM issue re-indexing if there were spurious
+"page" type documents remaining).
+
+## Message Types
+
+Scholar Update Request JSON
+- `key`: str
+- `type`: str
+    - `fatcat_work`
+    - `sim_issue`
+- `updated`: datetime, UTC, of event resulting in this request
+- `work_ident`: str (works)
+- `fatcat_changelog`: int (works)
+- `sim_item`: str (items)
+
+"Heavy Intermediate" JSON (existing schema)
+- key
+- `fetched`: Optional[datetime], UTC, when this doc was collected
+
+Scholar Fulltext ES JSON (existing schema)
+
+
+## Kafka Topics
+
+fatcat-ENV.work-ident-updates
+    6x, long retention, key compaction
+    key: doc ident
+scholar-ENV.sim-updates
+    6x, long retention, key compaction
+    key: doc ident
+scholar-ENV.update-docs
+    12x, short retention (2 months?)
+    key: doc ident
+
+## Workers
+
+scholar-fetch-docs-worker
+    consumes fatcat and/or sim update requests, individually
+    constructs heavy intermediate
+    publishes to update-docs topic
+
+scholar-index-docs-worker
+    consumes updated "heavy intermediate" documents, in batches
+    transforms to elasticsearch schema
+    updates elasticsearch
diff --git a/proposals/2021-01-18_crude_query_parse.md b/proposals/2021-01-18_crude_query_parse.md
new file mode 100644
index 0000000..2a7663b
--- /dev/null
+++ b/proposals/2021-01-18_crude_query_parse.md
@@ -0,0 +1,18 @@
+
+
+Thinking of simple ways to reduce query parse errors and handle more queries as
+expected. In particular:
+
+- handle slashes in query tokens (eg, "N/A" without quotes)
+- handle semi-colons in queries, when they are not intended as filters
+- if query "looks like" a raw citation string, detect that and do citation
+  parsing in to a structured format, then do a query or fuzzy lookup from there
+
+
+## Questions/Thoughts
+
+Should we detect title lookups in addition to full citation lookups? Probably
+too complicated.
+
+Do we have a static list of colon-prefixes, or load from the schema mapping
+file itself?
diff --git a/proposals/2021-02-15_ui_updates.md b/proposals/2021-02-15_ui_updates.md
new file mode 100644
index 0000000..72e4743
--- /dev/null
+++ b/proposals/2021-02-15_ui_updates.md
@@ -0,0 +1,53 @@
+
+status: partially-implemented
+
+This documents a series of changes made in early 2021, before launch.
+
+## Default URLs and Access (done)
+
+Replace current access link under thumbnail with a box that can expand to show
+more access options: domain, rel, filetype, release (version), maybe wayback date
+
+Labels over the thumbnail should show type (PDF, HTML), and maybe release stage
+(if different from primary release).
+
+"Blue Links" for each hit should change, eg:
+
+- if arxiv, arxiv.org
+- elif PMID or PMCID, PubMed
+- elif DOI, publisher (or whatever; follow the DOI)
+- elif microfilm, go to access
+- else fatcat landing page
+
+What about: JSTOR, DOAJ
+
+
+## Version Display (done)
+
+Instead of showing a grid, could keep style similar to what already exits: the
+single line of year/venue/status, then a line of identifiers in green (done)
+
+
+## Query Behaviors
+
+- "fail less": re-write more queries, potentially after ES has already returned a failure (done)
+- change the default of only showing fulltext hits?
+
+
+## Tooltips/Extras (done)
+
+- show date when mouse-over year field
+- have some link of container name to fatcat container page
+
+
+## Clickable Queries
+
+Allow search filters by clicking on: author, year, container
+
+Filters should simply be added to current query string. Not sure how to implement.
+
+
+## Responsive Design (done)
+
+There is a window width (tablet?) where we keep a fixed column width with
+margins, which results in small thumbnails. (done)
diff --git a/proposals/2021_crude_query_parse.md b/proposals/2021_crude_query_parse.md
deleted file mode 100644
index 2a7663b..0000000
--- a/proposals/2021_crude_query_parse.md
+++ /dev/null
@@ -1,18 +0,0 @@
-
-
-Thinking of simple ways to reduce query parse errors and handle more queries as
-expected. In particular:
-
-- handle slashes in query tokens (eg, "N/A" without quotes)
-- handle semi-colons in queries, when they are not intended as filters
-- if query "looks like" a raw citation string, detect that and do citation
-  parsing in to a structured format, then do a query or fuzzy lookup from there
-
-
-## Questions/Thoughts
-
-Should we detect title lookups in addition to full citation lookups? Probably
-too complicated.
-
-Do we have a static list of colon-prefixes, or load from the schema mapping
-file itself?
diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md
deleted file mode 100644
index deafb65..0000000
--- a/proposals/fatcat_indexing_pipeline.md
+++ /dev/null
@@ -1,54 +0,0 @@
-
-## High-Level
-
-Work-oriented: base input is arrays of expanded releases, all from the same
-work.
-
-Re-index pipeline would look at fatcat changelog or existing release feed, and
-use the `work_id` to fetch all other releases.
-
-Batch indexing pipeline would use a new variant of `fatcat-export` which is
-expanded releases (one-per-line), grouped (or sorted) by work id.
-
-Then, pipeline looks like:
-
-- choose canonical release
-- choose best access
-- choose best fulltext file
-    => iterate releases and files
-    => soft prefer canonical release, file access, release_date, etc
-    => check via postgrest query that fulltext is available
-    => fetch raw fulltext
-- check if we expect a SIM copy to exist
-    => eg, using an issue db?
-    => if so, fetch petabox metadata and try to confirm, so we can create a URL
-    => if we don't have another fulltext source (?):
-        => fetch djvu file and extract the pages in question (or just 1 if unsure?)
-- output "heavy" object
-
-Next step is:
-
-- summarize biblio metadata
-- select one abstract per language
-- sanitize abstracts and fulltext content for indexing
-- compute counts, epistimological quality, etc
-
-The output of that goes to Kafka for indexing into ES.
-
-This indexing process is probably going to be both CPU and network intensive.
-In python will want multiprocessing and maybe also async?
-
-## Implementation
-
-Existing tools/libraries:
-
-- fatcat-openapi-client
-- postgrest client
-- S3/minio/seaweed client
-- ftfy
-- language detection
-
-New needed (eventually):
-
-- strip latex
-- strip JATS or HTML
diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md
deleted file mode 100644
index 597a1b0..0000000
--- a/proposals/kafka_update_pipeline.md
+++ /dev/null
@@ -1,63 +0,0 @@
-
-Want to receive a continual stream of updates from both fatcat and SIM
-scanning; index the updated content; and push into elasticsearch.
-
-
-## Filtering and Affordances
-
-The `updated` and `fetched` timestamps are not immediately necessary or
-implemented, but they can be used to filter updates. For example, after
-re-loading from a build entity dump, could "roll back" update pipeline to only
-fatcat (work) updates after the changelog index that the bulk dump is stamped
-with.
-
-At least in theory, the `fetched` timestamp could be used to prevent re-updates
-of existing documents in the ES index.
-
-The `doc_index_ts` timestamp in the ES index could be used in a future
-fetch-and-reindex worker to select documents for re-indexing, or to delete
-old/stale documents (eg, after SIM issue re-indexing if there were spurious
-"page" type documents remaining).
-
-## Message Types
-
-Scholar Update Request JSON
-- `key`: str
-- `type`: str
-    - `fatcat_work`
-    - `sim_issue`
-- `updated`: datetime, UTC, of event resulting in this request
-- `work_ident`: str (works)
-- `fatcat_changelog`: int (works)
-- `sim_item`: str (items)
-
-"Heavy Intermediate" JSON (existing schema)
-- key
-- `fetched`: Optional[datetime], UTC, when this doc was collected
-
-Scholar Fulltext ES JSON (existing schema)
-
-
-## Kafka Topics
-
-fatcat-ENV.work-ident-updates
-    6x, long retention, key compaction
-    key: doc ident
-scholar-ENV.sim-updates
-    6x, long retention, key compaction
-    key: doc ident
-scholar-ENV.update-docs
-    12x, short retention (2 months?)
-    key: doc ident
-
-## Workers
-
-scholar-fetch-docs-worker
-    consumes fatcat and/or sim update requests, individually
-    constructs heavy intermediate
-    publishes to update-docs topic
-
-scholar-index-docs-worker
-    consumes updated "heavy intermediate" documents, in batches
-    transforms to elasticsearch schema
-    updates elasticsearch
diff --git a/proposals/microfilm_indexing_pipeline.md b/proposals/microfilm_indexing_pipeline.md
deleted file mode 100644
index 657aae2..0000000
--- a/proposals/microfilm_indexing_pipeline.md
+++ /dev/null
@@ -1,30 +0,0 @@
-
-## High-Level
-
-- operate on an entire item
-- check against issue DB and/or fatcat search
-    => if there is fatcat work-level metadata for this issue, skip
-- fetch collection-level (journal) metadata
-- iterate through djvu text file:
-    => convert to simple text
-    => filter out non-research pages using quick heuristics
-    => try looking up "real" page number from OCR work (in item metadata)
-- generate "heavy" intermediate schema (per valid page):
-    => fatcat container metadata
-    => ia collection (journal) metadata
-    => item metadata
-    => page fulltext and any metadata
-
-- transform "heavy" intermediates to ES schema
-
-## Implementation
-
-Existing tools and libraries:
-
-- internetarchive python tool to fetch files and item metadata
-- fatcat API client for container metadata lookup
-
-New tools or libraries needed:
-
-- issue DB or use fatcat search index to count releases by volume/issue
-- djvu XML parser
diff --git a/proposals/overview.md b/proposals/overview.md
deleted file mode 100644
index fa8148c..0000000
--- a/proposals/overview.md
+++ /dev/null
@@ -1,38 +0,0 @@
-
-
-Can be multiple releases for each work:
-
-- required: most canonical published version ("version of record", what would be cited)
-    => or, most updated?
-- optional: mostly openly accessible version
-- optional: updated version
-    => errata, corrected version, or retraction
-- optional: fulltext indexed version
-    => might be not the most updated, or no accessible
-
-
-## Initial Plan
-
-Index all fatcat works in catalog.
-
-Always link to a born-digital copy if one is accessible.
-
-Always link to a SIM microfilm copy if one is available.
-
-Use best available fulltext for search. If structured, like TEI-XML, index the
-body text separate from abstracts and references.
-
-
-## Other Ideas
-
-Do fulltext indexing at the granularity of pages, or some other segments of
-text within articles (paragraphs, chapters, sections).
-
-Fatcat already has all of Crossref, Pubmed, Arxiv, and several other
-authoritative metadata sources. But today we are missing a good chunk of
-content, particularly from institutional repositories and CS conferences (which
-don't use identifiers). Also don't have good affiliation or citation count
-coverage, and mixed/poor abstract coverage.
-
-Could use Microsoft Academic Graph (MAG) metadata corpus (or similar) to
-bootstrap with better metadata coverage.
diff --git a/proposals/web_interface.md b/proposals/web_interface.md
deleted file mode 100644
index 416e6fc..0000000
--- a/proposals/web_interface.md
+++ /dev/null
@@ -1,69 +0,0 @@
-
-Single domain (TBD, but eg <https://scholar.archive.org>) will host a web
-search interface. May also expose APIs on this host, or might use a separate
-host for that.
-
-Content would not be hosted on this domain; all fulltext copies would be linked
-to elsewhere.
-
-Style (eg, colors, font?) would be similar to <https://archive.org>, but may or
-may not have regular top bar (<https://web.archive.org> has this). There would
-be no "write" or "modify" features on this site at all: users would not need to
-log in. Metadata updates and features would all redirect to archive.org or
-fatcat.wiki.
-
-
-## Design and Features
-
-Will try to hew most closely to Pubmed in style, layout, and features.
-
-Only a single search interface (no separate "advanced" page). Custom query
-parser.
-
-Filtering and sort via controls under search box. A button opens a box with
-more settings. If these are persisted at all, only via cookies or local
-storage.
-
-## URL Structure
-
-All pages can be prefixed with a two-character language specifier. Default
-(with no prefix) is english.
-
-`/`: homepage, single-sentance, large search box, quick stats and info
-
-`/about`: about
-
-`/help`: FAQ?
-
-`/help/search`: advanced query tips
-
-`/search`: query and results page
-
-
-## More Ideas
-
-Things we *could* do, but maybe *shouldn't*:
-
-- journal-level metadata and summary. Could just link to fatcat.
-
-
-## APIs
-
-Might also expose as public APIs on that domain:
-
-- search
-- citation matching
-- save-paper-now
-
-
-## Implementation
-
-For first iteration, going to use:
-
-- python3.7
-- elasticsearch-dsl from python and page-load-per-query (not single-page-app)
-- fastapi (web framework)
-- jinja2 (HTML templating)
-- babel (i18n)
-- semantic-ui (CSS)
-- minimal or no javascript
diff --git a/proposals/work_schema.md b/proposals/work_schema.md
deleted file mode 100644
index 97d60ac..0000000
--- a/proposals/work_schema.md
+++ /dev/null
@@ -1,108 +0,0 @@
-
-## Top-Level
-
-- type: `_doc` (aka, no type, `include_type_name=false`)
-- key: keyword (same as `_id`)
-- `collapse_key`: work ident, or SIM issue item (for collapsing/grouping search hits)
-- `doc_type`: keyword (work or page)
-- `doc_index_ts`: timestamp when document indexed
-- `work_ident`: fatcat work ident (optional)
-
-- `biblio`: obj
-- `fulltext`: obj
-- `ia_sim`: obj
-- `abstracts`: nested
-    body
-    lang
-- `releases`: nested (TBD)
-- `access`
-- `tags`: array of keywords
-
-TODO:
-- summary fields to index "everything" into?
-
-## Biblio
-
-Mostly matches existing `fatcat_release` schema.
-
-- `release_id`
-- `release_revision`
-- `title`
-- `subtitle`
-- `original_title`
-- `release_date`
-- `release_year`
-- `withdrawn_status`
-- `language`
-- `country_code`
-- `volume` (etc)
-- `volume_int` (etc)
-- `first_page`
-- `first_page_int`
-- `pages`
-- `doi` etc
-- `number` (etc)
-
-NEW:
-- `preservation_status`
-
-[etc]
-
-- `license_slug`
-- `publisher` (etc)
-- `container_name` (etc)
-- `container_id`
-- `container_issnl`
-- `container_wikidata_qid`
-- `issns` (array)
-- `contrib_names`
-- `affiliations`
-- `creator_ids`
-
-TODO: should all external identifiers go under `releases` instead of `biblio`? Or some duplicated?
-
-## Fulltext
-
-- `status`: web, sim, shadow
-- `body`
-- `lang`
-- `file_mimetype`
-- `file_sha1`
-- `file_id`
-- `thumbnail_url`
-
-## Abstracts
-
-Nested object with:
-
-- body
-- lang
-
-For prototyping, perhaps just make it an object with `body` as an array.
-
-Only index one abstract per language.
-
-## SIM (Microfilm)
-
-Enough details to construct a link or do a lookup or whatever. Note that might
-be doing CDL status lookups on SERP pages.
-
-- `issue_item`: str
-- `pub_collection`: str
-- `sim_pubid`: str
-- `first_page`: str
-
-
-Also pass-through archive.org metadata here (collection-level and item-level)
-
-## Access
-
-Start with obj, but maybe later nested?
-
-- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc
-- `mimetype`
-- `access_url`
-- `file_url`
-- `file_id`
-- `release_id`
-
-- 
cgit v1.2.3