diff options
Diffstat (limited to 'proposals/2020-06-04_work_schema.md')
-rw-r--r-- | proposals/2020-06-04_work_schema.md | 108 |
1 files changed, 108 insertions, 0 deletions
diff --git a/proposals/2020-06-04_work_schema.md b/proposals/2020-06-04_work_schema.md new file mode 100644 index 0000000..97d60ac --- /dev/null +++ b/proposals/2020-06-04_work_schema.md @@ -0,0 +1,108 @@ + +## Top-Level + +- type: `_doc` (aka, no type, `include_type_name=false`) +- key: keyword (same as `_id`) +- `collapse_key`: work ident, or SIM issue item (for collapsing/grouping search hits) +- `doc_type`: keyword (work or page) +- `doc_index_ts`: timestamp when document indexed +- `work_ident`: fatcat work ident (optional) + +- `biblio`: obj +- `fulltext`: obj +- `ia_sim`: obj +- `abstracts`: nested + body + lang +- `releases`: nested (TBD) +- `access` +- `tags`: array of keywords + +TODO: +- summary fields to index "everything" into? + +## Biblio + +Mostly matches existing `fatcat_release` schema. + +- `release_id` +- `release_revision` +- `title` +- `subtitle` +- `original_title` +- `release_date` +- `release_year` +- `withdrawn_status` +- `language` +- `country_code` +- `volume` (etc) +- `volume_int` (etc) +- `first_page` +- `first_page_int` +- `pages` +- `doi` etc +- `number` (etc) + +NEW: +- `preservation_status` + +[etc] + +- `license_slug` +- `publisher` (etc) +- `container_name` (etc) +- `container_id` +- `container_issnl` +- `container_wikidata_qid` +- `issns` (array) +- `contrib_names` +- `affiliations` +- `creator_ids` + +TODO: should all external identifiers go under `releases` instead of `biblio`? Or some duplicated? + +## Fulltext + +- `status`: web, sim, shadow +- `body` +- `lang` +- `file_mimetype` +- `file_sha1` +- `file_id` +- `thumbnail_url` + +## Abstracts + +Nested object with: + +- body +- lang + +For prototyping, perhaps just make it an object with `body` as an array. + +Only index one abstract per language. + +## SIM (Microfilm) + +Enough details to construct a link or do a lookup or whatever. Note that might +be doing CDL status lookups on SERP pages. + +- `issue_item`: str +- `pub_collection`: str +- `sim_pubid`: str +- `first_page`: str + + +Also pass-through archive.org metadata here (collection-level and item-level) + +## Access + +Start with obj, but maybe later nested? + +- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc +- `mimetype` +- `access_url` +- `file_url` +- `file_id` +- `release_id` + |