## Top-Level - type: `_doc` (aka, no type, `include_type_name=false`) - key: keyword (same as `_id`) - `collapse_key`: work ident, or SIM issue item (for collapsing/grouping search hits) - `doc_type`: keyword (work or page) - `doc_index_ts`: timestamp when document indexed - `work_ident`: fatcat work ident (optional) - `biblio`: obj - `fulltext`: obj - `ia_sim`: obj - `abstracts`: nested body lang - `releases`: nested (TBD) - `access` - `tags`: array of keywords TODO: - summary fields to index "everything" into? ## Biblio Mostly matches existing `fatcat_release` schema. - `release_id` - `release_revision` - `title` - `subtitle` - `original_title` - `release_date` - `release_year` - `withdrawn_status` - `language` - `country_code` - `volume` (etc) - `volume_int` (etc) - `first_page` - `first_page_int` - `pages` - `doi` etc - `number` (etc) NEW: - `preservation_status` [etc] - `license_slug` - `publisher` (etc) - `container_name` (etc) - `container_id` - `container_issnl` - `container_wikidata_qid` - `issns` (array) - `contrib_names` - `affiliations` - `creator_ids` TODO: should all external identifiers go under `releases` instead of `biblio`? Or some duplicated? ## Fulltext - `status`: web, sim, shadow - `body` - `lang` - `file_mimetype` - `file_sha1` - `file_id` - `thumbnail_url` ## Abstracts Nested object with: - body - lang For prototyping, perhaps just make it an object with `body` as an array. Only index one abstract per language. ## SIM (Microfilm) Enough details to construct a link or do a lookup or whatever. Note that might be doing CDL status lookups on SERP pages. - `issue_item`: str - `pub_collection`: str - `sim_pubid`: str - `first_page`: str Also pass-through archive.org metadata here (collection-level and item-level) ## Access Start with obj, but maybe later nested? - `status`: direct, cdl, repository, publisher, loginwall, paywall, etc - `mimetype` - `access_url` - `file_url` - `file_id` - `release_id`