diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-05-11 19:12:13 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-05-11 19:12:13 -0700 |
commit | f5a883642dd114ac2c29c72348bed05616189aa2 (patch) | |
tree | a6952af6c83529f563c34197fb269f55615e01f7 /proposals/work_schema.md | |
parent | b5a8d71d6ca1f54c4ba0e558d021e347ec634319 (diff) | |
download | fatcat-scholar-f5a883642dd114ac2c29c72348bed05616189aa2.tar.gz fatcat-scholar-f5a883642dd114ac2c29c72348bed05616189aa2.zip |
start sketching proposals
Diffstat (limited to 'proposals/work_schema.md')
-rw-r--r-- | proposals/work_schema.md | 96 |
1 files changed, 96 insertions, 0 deletions
diff --git a/proposals/work_schema.md b/proposals/work_schema.md new file mode 100644 index 0000000..1e0f272 --- /dev/null +++ b/proposals/work_schema.md @@ -0,0 +1,96 @@ + +## Top-Level + +- type: _doc +- key: keyword +- key_type: keyword (work or page) +- `work_id` +- biblio: obj +- fulltext: obj +- sim: obj +- abstracts: nested + body + lang +- releases: nested (TBD) +- access +- tags: array of keywords + +TODO: +- summary fields to index "everything" into? + +## Biblio + +Mostly matches existing `fatcat_release` schema. + +- `release_id` +- `release_revision` +- `title` +- `subtitle` +- `original_title` +- `release_date` +- `release_year` +- `withdrawn_status` +- `language` +- `country_code` +- `volume` (etc) +- `volume_int` (etc) +- `first_page` +- `first_page_int` +- `pages` +- `doi` etc +- `number` (etc) + +NEW: +- `preservation_status` + +[etc] + +- `license_slug` +- `publisher` (etc) +- `container_name` (etc) +- `container_id` +- `container_issnl` +- `container_issn` (array) +- `contrib_names` +- `affiliations` +- `creator_ids` + +## Fulltext + +- `status`: web, sim, shadow +- `body` +- `lang` +- `file_mimetype` +- `file_sha1` +- `file_id` +- `thumbnail_url` + +## Abstracts + +Nested object with: + +- body +- lang + +For prototyping, perhaps just make it an object with `body` as an array. + +Only index one abstract per language. + +## SIM (Microfilm) + +Enough details to construct a link or do a lookup or whatever. Note that might +be doing CDL status lookups on SERP pages. + +Also pass-through archive.org metadata here (collection-level and item-level) + +## Access + +Start with obj, but maybe later nested? + +- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc +- `mimetype` +- `access_url` +- `file_url` +- `file_id` +- `release_id` + |