summaryrefslogtreecommitdiffstats
path: root/proposals/2020-06-04_work_schema.md
diff options
context:
space:
mode:
Diffstat (limited to 'proposals/2020-06-04_work_schema.md')
-rw-r--r--proposals/2020-06-04_work_schema.md108
1 files changed, 108 insertions, 0 deletions
diff --git a/proposals/2020-06-04_work_schema.md b/proposals/2020-06-04_work_schema.md
new file mode 100644
index 0000000..97d60ac
--- /dev/null
+++ b/proposals/2020-06-04_work_schema.md
@@ -0,0 +1,108 @@
+
+## Top-Level
+
+- type: `_doc` (aka, no type, `include_type_name=false`)
+- key: keyword (same as `_id`)
+- `collapse_key`: work ident, or SIM issue item (for collapsing/grouping search hits)
+- `doc_type`: keyword (work or page)
+- `doc_index_ts`: timestamp when document indexed
+- `work_ident`: fatcat work ident (optional)
+
+- `biblio`: obj
+- `fulltext`: obj
+- `ia_sim`: obj
+- `abstracts`: nested
+ body
+ lang
+- `releases`: nested (TBD)
+- `access`
+- `tags`: array of keywords
+
+TODO:
+- summary fields to index "everything" into?
+
+## Biblio
+
+Mostly matches existing `fatcat_release` schema.
+
+- `release_id`
+- `release_revision`
+- `title`
+- `subtitle`
+- `original_title`
+- `release_date`
+- `release_year`
+- `withdrawn_status`
+- `language`
+- `country_code`
+- `volume` (etc)
+- `volume_int` (etc)
+- `first_page`
+- `first_page_int`
+- `pages`
+- `doi` etc
+- `number` (etc)
+
+NEW:
+- `preservation_status`
+
+[etc]
+
+- `license_slug`
+- `publisher` (etc)
+- `container_name` (etc)
+- `container_id`
+- `container_issnl`
+- `container_wikidata_qid`
+- `issns` (array)
+- `contrib_names`
+- `affiliations`
+- `creator_ids`
+
+TODO: should all external identifiers go under `releases` instead of `biblio`? Or some duplicated?
+
+## Fulltext
+
+- `status`: web, sim, shadow
+- `body`
+- `lang`
+- `file_mimetype`
+- `file_sha1`
+- `file_id`
+- `thumbnail_url`
+
+## Abstracts
+
+Nested object with:
+
+- body
+- lang
+
+For prototyping, perhaps just make it an object with `body` as an array.
+
+Only index one abstract per language.
+
+## SIM (Microfilm)
+
+Enough details to construct a link or do a lookup or whatever. Note that might
+be doing CDL status lookups on SERP pages.
+
+- `issue_item`: str
+- `pub_collection`: str
+- `sim_pubid`: str
+- `first_page`: str
+
+
+Also pass-through archive.org metadata here (collection-level and item-level)
+
+## Access
+
+Start with obj, but maybe later nested?
+
+- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc
+- `mimetype`
+- `access_url`
+- `file_url`
+- `file_id`
+- `release_id`
+