From 5defd444135bc4adb0748b0d2b8c9b88708bdc1a Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 23 Mar 2021 21:42:32 -0700 Subject: proposals: add 2021 UI updates, and rename all to have a date in filename --- .../2020-05-11_microfilm_indexing_pipeline.md | 30 ++++++ proposals/2020-05-11_overview.md | 38 ++++++++ proposals/2020-05-11_web_interface.md | 69 +++++++++++++ proposals/2020-05-16_fatcat_indexing_pipeline.md | 54 +++++++++++ proposals/2020-06-04_work_schema.md | 108 +++++++++++++++++++++ proposals/2020-10-20_kafka_update_pipeline.md | 63 ++++++++++++ proposals/2021-01-18_crude_query_parse.md | 18 ++++ proposals/2021-02-15_ui_updates.md | 53 ++++++++++ proposals/2021_crude_query_parse.md | 18 ---- proposals/fatcat_indexing_pipeline.md | 54 ----------- proposals/kafka_update_pipeline.md | 63 ------------ proposals/microfilm_indexing_pipeline.md | 30 ------ proposals/overview.md | 38 -------- proposals/web_interface.md | 69 ------------- proposals/work_schema.md | 108 --------------------- 15 files changed, 433 insertions(+), 380 deletions(-) create mode 100644 proposals/2020-05-11_microfilm_indexing_pipeline.md create mode 100644 proposals/2020-05-11_overview.md create mode 100644 proposals/2020-05-11_web_interface.md create mode 100644 proposals/2020-05-16_fatcat_indexing_pipeline.md create mode 100644 proposals/2020-06-04_work_schema.md create mode 100644 proposals/2020-10-20_kafka_update_pipeline.md create mode 100644 proposals/2021-01-18_crude_query_parse.md create mode 100644 proposals/2021-02-15_ui_updates.md delete mode 100644 proposals/2021_crude_query_parse.md delete mode 100644 proposals/fatcat_indexing_pipeline.md delete mode 100644 proposals/kafka_update_pipeline.md delete mode 100644 proposals/microfilm_indexing_pipeline.md delete mode 100644 proposals/overview.md delete mode 100644 proposals/web_interface.md delete mode 100644 proposals/work_schema.md diff --git a/proposals/2020-05-11_microfilm_indexing_pipeline.md b/proposals/2020-05-11_microfilm_indexing_pipeline.md new file mode 100644 index 0000000..657aae2 --- /dev/null +++ b/proposals/2020-05-11_microfilm_indexing_pipeline.md @@ -0,0 +1,30 @@ + +## High-Level + +- operate on an entire item +- check against issue DB and/or fatcat search + => if there is fatcat work-level metadata for this issue, skip +- fetch collection-level (journal) metadata +- iterate through djvu text file: + => convert to simple text + => filter out non-research pages using quick heuristics + => try looking up "real" page number from OCR work (in item metadata) +- generate "heavy" intermediate schema (per valid page): + => fatcat container metadata + => ia collection (journal) metadata + => item metadata + => page fulltext and any metadata + +- transform "heavy" intermediates to ES schema + +## Implementation + +Existing tools and libraries: + +- internetarchive python tool to fetch files and item metadata +- fatcat API client for container metadata lookup + +New tools or libraries needed: + +- issue DB or use fatcat search index to count releases by volume/issue +- djvu XML parser diff --git a/proposals/2020-05-11_overview.md b/proposals/2020-05-11_overview.md new file mode 100644 index 0000000..fa8148c --- /dev/null +++ b/proposals/2020-05-11_overview.md @@ -0,0 +1,38 @@ + + +Can be multiple releases for each work: + +- required: most canonical published version ("version of record", what would be cited) + => or, most updated? +- optional: mostly openly accessible version +- optional: updated version + => errata, corrected version, or retraction +- optional: fulltext indexed version + => might be not the most updated, or no accessible + + +## Initial Plan + +Index all fatcat works in catalog. + +Always link to a born-digital copy if one is accessible. + +Always link to a SIM microfilm copy if one is available. + +Use best available fulltext for search. If structured, like TEI-XML, index the +body text separate from abstracts and references. + + +## Other Ideas + +Do fulltext indexing at the granularity of pages, or some other segments of +text within articles (paragraphs, chapters, sections). + +Fatcat already has all of Crossref, Pubmed, Arxiv, and several other +authoritative metadata sources. But today we are missing a good chunk of +content, particularly from institutional repositories and CS conferences (which +don't use identifiers). Also don't have good affiliation or citation count +coverage, and mixed/poor abstract coverage. + +Could use Microsoft Academic Graph (MAG) metadata corpus (or similar) to +bootstrap with better metadata coverage. diff --git a/proposals/2020-05-11_web_interface.md b/proposals/2020-05-11_web_interface.md new file mode 100644 index 0000000..416e6fc --- /dev/null +++ b/proposals/2020-05-11_web_interface.md @@ -0,0 +1,69 @@ + +Single domain (TBD, but eg ) will host a web +search interface. May also expose APIs on this host, or might use a separate +host for that. + +Content would not be hosted on this domain; all fulltext copies would be linked +to elsewhere. + +Style (eg, colors, font?) would be similar to , but may or +may not have regular top bar ( has this). There would +be no "write" or "modify" features on this site at all: users would not need to +log in. Metadata updates and features would all redirect to archive.org or +fatcat.wiki. + + +## Design and Features + +Will try to hew most closely to Pubmed in style, layout, and features. + +Only a single search interface (no separate "advanced" page). Custom query +parser. + +Filtering and sort via controls under search box. A button opens a box with +more settings. If these are persisted at all, only via cookies or local +storage. + +## URL Structure + +All pages can be prefixed with a two-character language specifier. Default +(with no prefix) is english. + +`/`: homepage, single-sentance, large search box, quick stats and info + +`/about`: about + +`/help`: FAQ? + +`/help/search`: advanced query tips + +`/search`: query and results page + + +## More Ideas + +Things we *could* do, but maybe *shouldn't*: + +- journal-level metadata and summary. Could just link to fatcat. + + +## APIs + +Might also expose as public APIs on that domain: + +- search +- citation matching +- save-paper-now + + +## Implementation + +For first iteration, going to use: + +- python3.7 +- elasticsearch-dsl from python and page-load-per-query (not single-page-app) +- fastapi (web framework) +- jinja2 (HTML templating) +- babel (i18n) +- semantic-ui (CSS) +- minimal or no javascript diff --git a/proposals/2020-05-16_fatcat_indexing_pipeline.md b/proposals/2020-05-16_fatcat_indexing_pipeline.md new file mode 100644 index 0000000..deafb65 --- /dev/null +++ b/proposals/2020-05-16_fatcat_indexing_pipeline.md @@ -0,0 +1,54 @@ + +## High-Level + +Work-oriented: base input is arrays of expanded releases, all from the same +work. + +Re-index pipeline would look at fatcat changelog or existing release feed, and +use the `work_id` to fetch all other releases. + +Batch indexing pipeline would use a new variant of `fatcat-export` which is +expanded releases (one-per-line), grouped (or sorted) by work id. + +Then, pipeline looks like: + +- choose canonical release +- choose best access +- choose best fulltext file + => iterate releases and files + => soft prefer canonical release, file access, release_date, etc + => check via postgrest query that fulltext is available + => fetch raw fulltext +- check if we expect a SIM copy to exist + => eg, using an issue db? + => if so, fetch petabox metadata and try to confirm, so we can create a URL + => if we don't have another fulltext source (?): + => fetch djvu file and extract the pages in question (or just 1 if unsure?) +- output "heavy" object + +Next step is: + +- summarize biblio metadata +- select one abstract per language +- sanitize abstracts and fulltext content for indexing +- compute counts, epistimological quality, etc + +The output of that goes to Kafka for indexing into ES. + +This indexing process is probably going to be both CPU and network intensive. +In python will want multiprocessing and maybe also async? + +## Implementation + +Existing tools/libraries: + +- fatcat-openapi-client +- postgrest client +- S3/minio/seaweed client +- ftfy +- language detection + +New needed (eventually): + +- strip latex +- strip JATS or HTML diff --git a/proposals/2020-06-04_work_schema.md b/proposals/2020-06-04_work_schema.md new file mode 100644 index 0000000..97d60ac --- /dev/null +++ b/proposals/2020-06-04_work_schema.md @@ -0,0 +1,108 @@ + +## Top-Level + +- type: `_doc` (aka, no type, `include_type_name=false`) +- key: keyword (same as `_id`) +- `collapse_key`: work ident, or SIM issue item (for collapsing/grouping search hits) +- `doc_type`: keyword (work or page) +- `doc_index_ts`: timestamp when document indexed +- `work_ident`: fatcat work ident (optional) + +- `biblio`: obj +- `fulltext`: obj +- `ia_sim`: obj +- `abstracts`: nested + body + lang +- `releases`: nested (TBD) +- `access` +- `tags`: array of keywords + +TODO: +- summary fields to index "everything" into? + +## Biblio + +Mostly matches existing `fatcat_release` schema. + +- `release_id` +- `release_revision` +- `title` +- `subtitle` +- `original_title` +- `release_date` +- `release_year` +- `withdrawn_status` +- `language` +- `country_code` +- `volume` (etc) +- `volume_int` (etc) +- `first_page` +- `first_page_int` +- `pages` +- `doi` etc +- `number` (etc) + +NEW: +- `preservation_status` + +[etc] + +- `license_slug` +- `publisher` (etc) +- `container_name` (etc) +- `container_id` +- `container_issnl` +- `container_wikidata_qid` +- `issns` (array) +- `contrib_names` +- `affiliations` +- `creator_ids` + +TODO: should all external identifiers go under `releases` instead of `biblio`? Or some duplicated? + +## Fulltext + +- `status`: web, sim, shadow +- `body` +- `lang` +- `file_mimetype` +- `file_sha1` +- `file_id` +- `thumbnail_url` + +## Abstracts + +Nested object with: + +- body +- lang + +For prototyping, perhaps just make it an object with `body` as an array. + +Only index one abstract per language. + +## SIM (Microfilm) + +Enough details to construct a link or do a lookup or whatever. Note that might +be doing CDL status lookups on SERP pages. + +- `issue_item`: str +- `pub_collection`: str +- `sim_pubid`: str +- `first_page`: str + + +Also pass-through archive.org metadata here (collection-level and item-level) + +## Access + +Start with obj, but maybe later nested? + +- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc +- `mimetype` +- `access_url` +- `file_url` +- `file_id` +- `release_id` + diff --git a/proposals/2020-10-20_kafka_update_pipeline.md b/proposals/2020-10-20_kafka_update_pipeline.md new file mode 100644 index 0000000..597a1b0 --- /dev/null +++ b/proposals/2020-10-20_kafka_update_pipeline.md @@ -0,0 +1,63 @@ + +Want to receive a continual stream of updates from both fatcat and SIM +scanning; index the updated content; and push into elasticsearch. + + +## Filtering and Affordances + +The `updated` and `fetched` timestamps are not immediately necessary or +implemented, but they can be used to filter updates. For example, after +re-loading from a build entity dump, could "roll back" update pipeline to only +fatcat (work) updates after the changelog index that the bulk dump is stamped +with. + +At least in theory, the `fetched` timestamp could be used to prevent re-updates +of existing documents in the ES index. + +The `doc_index_ts` timestamp in the ES index could be used in a future +fetch-and-reindex worker to select documents for re-indexing, or to delete +old/stale documents (eg, after SIM issue re-indexing if there were spurious +"page" type documents remaining). + +## Message Types + +Scholar Update Request JSON +- `key`: str +- `type`: str + - `fatcat_work` + - `sim_issue` +- `updated`: datetime, UTC, of event resulting in this request +- `work_ident`: str (works) +- `fatcat_changelog`: int (works) +- `sim_item`: str (items) + +"Heavy Intermediate" JSON (existing schema) +- key +- `fetched`: Optional[datetime], UTC, when this doc was collected + +Scholar Fulltext ES JSON (existing schema) + + +## Kafka Topics + +fatcat-ENV.work-ident-updates + 6x, long retention, key compaction + key: doc ident +scholar-ENV.sim-updates + 6x, long retention, key compaction + key: doc ident +scholar-ENV.update-docs + 12x, short retention (2 months?) + key: doc ident + +## Workers + +scholar-fetch-docs-worker + consumes fatcat and/or sim update requests, individually + constructs heavy intermediate + publishes to update-docs topic + +scholar-index-docs-worker + consumes updated "heavy intermediate" documents, in batches + transforms to elasticsearch schema + updates elasticsearch diff --git a/proposals/2021-01-18_crude_query_parse.md b/proposals/2021-01-18_crude_query_parse.md new file mode 100644 index 0000000..2a7663b --- /dev/null +++ b/proposals/2021-01-18_crude_query_parse.md @@ -0,0 +1,18 @@ + + +Thinking of simple ways to reduce query parse errors and handle more queries as +expected. In particular: + +- handle slashes in query tokens (eg, "N/A" without quotes) +- handle semi-colons in queries, when they are not intended as filters +- if query "looks like" a raw citation string, detect that and do citation + parsing in to a structured format, then do a query or fuzzy lookup from there + + +## Questions/Thoughts + +Should we detect title lookups in addition to full citation lookups? Probably +too complicated. + +Do we have a static list of colon-prefixes, or load from the schema mapping +file itself? diff --git a/proposals/2021-02-15_ui_updates.md b/proposals/2021-02-15_ui_updates.md new file mode 100644 index 0000000..72e4743 --- /dev/null +++ b/proposals/2021-02-15_ui_updates.md @@ -0,0 +1,53 @@ + +status: partially-implemented + +This documents a series of changes made in early 2021, before launch. + +## Default URLs and Access (done) + +Replace current access link under thumbnail with a box that can expand to show +more access options: domain, rel, filetype, release (version), maybe wayback date + +Labels over the thumbnail should show type (PDF, HTML), and maybe release stage +(if different from primary release). + +"Blue Links" for each hit should change, eg: + +- if arxiv, arxiv.org +- elif PMID or PMCID, PubMed +- elif DOI, publisher (or whatever; follow the DOI) +- elif microfilm, go to access +- else fatcat landing page + +What about: JSTOR, DOAJ + + +## Version Display (done) + +Instead of showing a grid, could keep style similar to what already exits: the +single line of year/venue/status, then a line of identifiers in green (done) + + +## Query Behaviors + +- "fail less": re-write more queries, potentially after ES has already returned a failure (done) +- change the default of only showing fulltext hits? + + +## Tooltips/Extras (done) + +- show date when mouse-over year field +- have some link of container name to fatcat container page + + +## Clickable Queries + +Allow search filters by clicking on: author, year, container + +Filters should simply be added to current query string. Not sure how to implement. + + +## Responsive Design (done) + +There is a window width (tablet?) where we keep a fixed column width with +margins, which results in small thumbnails. (done) diff --git a/proposals/2021_crude_query_parse.md b/proposals/2021_crude_query_parse.md deleted file mode 100644 index 2a7663b..0000000 --- a/proposals/2021_crude_query_parse.md +++ /dev/null @@ -1,18 +0,0 @@ - - -Thinking of simple ways to reduce query parse errors and handle more queries as -expected. In particular: - -- handle slashes in query tokens (eg, "N/A" without quotes) -- handle semi-colons in queries, when they are not intended as filters -- if query "looks like" a raw citation string, detect that and do citation - parsing in to a structured format, then do a query or fuzzy lookup from there - - -## Questions/Thoughts - -Should we detect title lookups in addition to full citation lookups? Probably -too complicated. - -Do we have a static list of colon-prefixes, or load from the schema mapping -file itself? diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md deleted file mode 100644 index deafb65..0000000 --- a/proposals/fatcat_indexing_pipeline.md +++ /dev/null @@ -1,54 +0,0 @@ - -## High-Level - -Work-oriented: base input is arrays of expanded releases, all from the same -work. - -Re-index pipeline would look at fatcat changelog or existing release feed, and -use the `work_id` to fetch all other releases. - -Batch indexing pipeline would use a new variant of `fatcat-export` which is -expanded releases (one-per-line), grouped (or sorted) by work id. - -Then, pipeline looks like: - -- choose canonical release -- choose best access -- choose best fulltext file - => iterate releases and files - => soft prefer canonical release, file access, release_date, etc - => check via postgrest query that fulltext is available - => fetch raw fulltext -- check if we expect a SIM copy to exist - => eg, using an issue db? - => if so, fetch petabox metadata and try to confirm, so we can create a URL - => if we don't have another fulltext source (?): - => fetch djvu file and extract the pages in question (or just 1 if unsure?) -- output "heavy" object - -Next step is: - -- summarize biblio metadata -- select one abstract per language -- sanitize abstracts and fulltext content for indexing -- compute counts, epistimological quality, etc - -The output of that goes to Kafka for indexing into ES. - -This indexing process is probably going to be both CPU and network intensive. -In python will want multiprocessing and maybe also async? - -## Implementation - -Existing tools/libraries: - -- fatcat-openapi-client -- postgrest client -- S3/minio/seaweed client -- ftfy -- language detection - -New needed (eventually): - -- strip latex -- strip JATS or HTML diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md deleted file mode 100644 index 597a1b0..0000000 --- a/proposals/kafka_update_pipeline.md +++ /dev/null @@ -1,63 +0,0 @@ - -Want to receive a continual stream of updates from both fatcat and SIM -scanning; index the updated content; and push into elasticsearch. - - -## Filtering and Affordances - -The `updated` and `fetched` timestamps are not immediately necessary or -implemented, but they can be used to filter updates. For example, after -re-loading from a build entity dump, could "roll back" update pipeline to only -fatcat (work) updates after the changelog index that the bulk dump is stamped -with. - -At least in theory, the `fetched` timestamp could be used to prevent re-updates -of existing documents in the ES index. - -The `doc_index_ts` timestamp in the ES index could be used in a future -fetch-and-reindex worker to select documents for re-indexing, or to delete -old/stale documents (eg, after SIM issue re-indexing if there were spurious -"page" type documents remaining). - -## Message Types - -Scholar Update Request JSON -- `key`: str -- `type`: str - - `fatcat_work` - - `sim_issue` -- `updated`: datetime, UTC, of event resulting in this request -- `work_ident`: str (works) -- `fatcat_changelog`: int (works) -- `sim_item`: str (items) - -"Heavy Intermediate" JSON (existing schema) -- key -- `fetched`: Optional[datetime], UTC, when this doc was collected - -Scholar Fulltext ES JSON (existing schema) - - -## Kafka Topics - -fatcat-ENV.work-ident-updates - 6x, long retention, key compaction - key: doc ident -scholar-ENV.sim-updates - 6x, long retention, key compaction - key: doc ident -scholar-ENV.update-docs - 12x, short retention (2 months?) - key: doc ident - -## Workers - -scholar-fetch-docs-worker - consumes fatcat and/or sim update requests, individually - constructs heavy intermediate - publishes to update-docs topic - -scholar-index-docs-worker - consumes updated "heavy intermediate" documents, in batches - transforms to elasticsearch schema - updates elasticsearch diff --git a/proposals/microfilm_indexing_pipeline.md b/proposals/microfilm_indexing_pipeline.md deleted file mode 100644 index 657aae2..0000000 --- a/proposals/microfilm_indexing_pipeline.md +++ /dev/null @@ -1,30 +0,0 @@ - -## High-Level - -- operate on an entire item -- check against issue DB and/or fatcat search - => if there is fatcat work-level metadata for this issue, skip -- fetch collection-level (journal) metadata -- iterate through djvu text file: - => convert to simple text - => filter out non-research pages using quick heuristics - => try looking up "real" page number from OCR work (in item metadata) -- generate "heavy" intermediate schema (per valid page): - => fatcat container metadata - => ia collection (journal) metadata - => item metadata - => page fulltext and any metadata - -- transform "heavy" intermediates to ES schema - -## Implementation - -Existing tools and libraries: - -- internetarchive python tool to fetch files and item metadata -- fatcat API client for container metadata lookup - -New tools or libraries needed: - -- issue DB or use fatcat search index to count releases by volume/issue -- djvu XML parser diff --git a/proposals/overview.md b/proposals/overview.md deleted file mode 100644 index fa8148c..0000000 --- a/proposals/overview.md +++ /dev/null @@ -1,38 +0,0 @@ - - -Can be multiple releases for each work: - -- required: most canonical published version ("version of record", what would be cited) - => or, most updated? -- optional: mostly openly accessible version -- optional: updated version - => errata, corrected version, or retraction -- optional: fulltext indexed version - => might be not the most updated, or no accessible - - -## Initial Plan - -Index all fatcat works in catalog. - -Always link to a born-digital copy if one is accessible. - -Always link to a SIM microfilm copy if one is available. - -Use best available fulltext for search. If structured, like TEI-XML, index the -body text separate from abstracts and references. - - -## Other Ideas - -Do fulltext indexing at the granularity of pages, or some other segments of -text within articles (paragraphs, chapters, sections). - -Fatcat already has all of Crossref, Pubmed, Arxiv, and several other -authoritative metadata sources. But today we are missing a good chunk of -content, particularly from institutional repositories and CS conferences (which -don't use identifiers). Also don't have good affiliation or citation count -coverage, and mixed/poor abstract coverage. - -Could use Microsoft Academic Graph (MAG) metadata corpus (or similar) to -bootstrap with better metadata coverage. diff --git a/proposals/web_interface.md b/proposals/web_interface.md deleted file mode 100644 index 416e6fc..0000000 --- a/proposals/web_interface.md +++ /dev/null @@ -1,69 +0,0 @@ - -Single domain (TBD, but eg ) will host a web -search interface. May also expose APIs on this host, or might use a separate -host for that. - -Content would not be hosted on this domain; all fulltext copies would be linked -to elsewhere. - -Style (eg, colors, font?) would be similar to , but may or -may not have regular top bar ( has this). There would -be no "write" or "modify" features on this site at all: users would not need to -log in. Metadata updates and features would all redirect to archive.org or -fatcat.wiki. - - -## Design and Features - -Will try to hew most closely to Pubmed in style, layout, and features. - -Only a single search interface (no separate "advanced" page). Custom query -parser. - -Filtering and sort via controls under search box. A button opens a box with -more settings. If these are persisted at all, only via cookies or local -storage. - -## URL Structure - -All pages can be prefixed with a two-character language specifier. Default -(with no prefix) is english. - -`/`: homepage, single-sentance, large search box, quick stats and info - -`/about`: about - -`/help`: FAQ? - -`/help/search`: advanced query tips - -`/search`: query and results page - - -## More Ideas - -Things we *could* do, but maybe *shouldn't*: - -- journal-level metadata and summary. Could just link to fatcat. - - -## APIs - -Might also expose as public APIs on that domain: - -- search -- citation matching -- save-paper-now - - -## Implementation - -For first iteration, going to use: - -- python3.7 -- elasticsearch-dsl from python and page-load-per-query (not single-page-app) -- fastapi (web framework) -- jinja2 (HTML templating) -- babel (i18n) -- semantic-ui (CSS) -- minimal or no javascript diff --git a/proposals/work_schema.md b/proposals/work_schema.md deleted file mode 100644 index 97d60ac..0000000 --- a/proposals/work_schema.md +++ /dev/null @@ -1,108 +0,0 @@ - -## Top-Level - -- type: `_doc` (aka, no type, `include_type_name=false`) -- key: keyword (same as `_id`) -- `collapse_key`: work ident, or SIM issue item (for collapsing/grouping search hits) -- `doc_type`: keyword (work or page) -- `doc_index_ts`: timestamp when document indexed -- `work_ident`: fatcat work ident (optional) - -- `biblio`: obj -- `fulltext`: obj -- `ia_sim`: obj -- `abstracts`: nested - body - lang -- `releases`: nested (TBD) -- `access` -- `tags`: array of keywords - -TODO: -- summary fields to index "everything" into? - -## Biblio - -Mostly matches existing `fatcat_release` schema. - -- `release_id` -- `release_revision` -- `title` -- `subtitle` -- `original_title` -- `release_date` -- `release_year` -- `withdrawn_status` -- `language` -- `country_code` -- `volume` (etc) -- `volume_int` (etc) -- `first_page` -- `first_page_int` -- `pages` -- `doi` etc -- `number` (etc) - -NEW: -- `preservation_status` - -[etc] - -- `license_slug` -- `publisher` (etc) -- `container_name` (etc) -- `container_id` -- `container_issnl` -- `container_wikidata_qid` -- `issns` (array) -- `contrib_names` -- `affiliations` -- `creator_ids` - -TODO: should all external identifiers go under `releases` instead of `biblio`? Or some duplicated? - -## Fulltext - -- `status`: web, sim, shadow -- `body` -- `lang` -- `file_mimetype` -- `file_sha1` -- `file_id` -- `thumbnail_url` - -## Abstracts - -Nested object with: - -- body -- lang - -For prototyping, perhaps just make it an object with `body` as an array. - -Only index one abstract per language. - -## SIM (Microfilm) - -Enough details to construct a link or do a lookup or whatever. Note that might -be doing CDL status lookups on SERP pages. - -- `issue_item`: str -- `pub_collection`: str -- `sim_pubid`: str -- `first_page`: str - - -Also pass-through archive.org metadata here (collection-level and item-level) - -## Access - -Start with obj, but maybe later nested? - -- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc -- `mimetype` -- `access_url` -- `file_url` -- `file_id` -- `release_id` - -- cgit v1.2.3