From f5a883642dd114ac2c29c72348bed05616189aa2 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Mon, 11 May 2020 19:12:13 -0700
Subject: start sketching proposals

---
 proposals/fatcat_indexing_pipeline.md    | 54 ++++++++++++++++++
 proposals/microfilm_indexing_pipeline.md | 30 ++++++++++
 proposals/overview.md                    | 38 +++++++++++++
 proposals/web_interface.md               | 69 +++++++++++++++++++++++
 proposals/work_schema.md                 | 96 ++++++++++++++++++++++++++++++++
 5 files changed, 287 insertions(+)
 create mode 100644 proposals/fatcat_indexing_pipeline.md
 create mode 100644 proposals/microfilm_indexing_pipeline.md
 create mode 100644 proposals/overview.md
 create mode 100644 proposals/web_interface.md
 create mode 100644 proposals/work_schema.md

diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md
new file mode 100644
index 0000000..deafb65
--- /dev/null
+++ b/proposals/fatcat_indexing_pipeline.md
@@ -0,0 +1,54 @@
+
+## High-Level
+
+Work-oriented: base input is arrays of expanded releases, all from the same
+work.
+
+Re-index pipeline would look at fatcat changelog or existing release feed, and
+use the `work_id` to fetch all other releases.
+
+Batch indexing pipeline would use a new variant of `fatcat-export` which is
+expanded releases (one-per-line), grouped (or sorted) by work id.
+
+Then, pipeline looks like:
+
+- choose canonical release
+- choose best access
+- choose best fulltext file
+    => iterate releases and files
+    => soft prefer canonical release, file access, release_date, etc
+    => check via postgrest query that fulltext is available
+    => fetch raw fulltext
+- check if we expect a SIM copy to exist
+    => eg, using an issue db?
+    => if so, fetch petabox metadata and try to confirm, so we can create a URL
+    => if we don't have another fulltext source (?):
+        => fetch djvu file and extract the pages in question (or just 1 if unsure?)
+- output "heavy" object
+
+Next step is:
+
+- summarize biblio metadata
+- select one abstract per language
+- sanitize abstracts and fulltext content for indexing
+- compute counts, epistimological quality, etc
+
+The output of that goes to Kafka for indexing into ES.
+
+This indexing process is probably going to be both CPU and network intensive.
+In python will want multiprocessing and maybe also async?
+
+## Implementation
+
+Existing tools/libraries:
+
+- fatcat-openapi-client
+- postgrest client
+- S3/minio/seaweed client
+- ftfy
+- language detection
+
+New needed (eventually):
+
+- strip latex
+- strip JATS or HTML
diff --git a/proposals/microfilm_indexing_pipeline.md b/proposals/microfilm_indexing_pipeline.md
new file mode 100644
index 0000000..657aae2
--- /dev/null
+++ b/proposals/microfilm_indexing_pipeline.md
@@ -0,0 +1,30 @@
+
+## High-Level
+
+- operate on an entire item
+- check against issue DB and/or fatcat search
+    => if there is fatcat work-level metadata for this issue, skip
+- fetch collection-level (journal) metadata
+- iterate through djvu text file:
+    => convert to simple text
+    => filter out non-research pages using quick heuristics
+    => try looking up "real" page number from OCR work (in item metadata)
+- generate "heavy" intermediate schema (per valid page):
+    => fatcat container metadata
+    => ia collection (journal) metadata
+    => item metadata
+    => page fulltext and any metadata
+
+- transform "heavy" intermediates to ES schema
+
+## Implementation
+
+Existing tools and libraries:
+
+- internetarchive python tool to fetch files and item metadata
+- fatcat API client for container metadata lookup
+
+New tools or libraries needed:
+
+- issue DB or use fatcat search index to count releases by volume/issue
+- djvu XML parser
diff --git a/proposals/overview.md b/proposals/overview.md
new file mode 100644
index 0000000..fa8148c
--- /dev/null
+++ b/proposals/overview.md
@@ -0,0 +1,38 @@
+
+
+Can be multiple releases for each work:
+
+- required: most canonical published version ("version of record", what would be cited)
+    => or, most updated?
+- optional: mostly openly accessible version
+- optional: updated version
+    => errata, corrected version, or retraction
+- optional: fulltext indexed version
+    => might be not the most updated, or no accessible
+
+
+## Initial Plan
+
+Index all fatcat works in catalog.
+
+Always link to a born-digital copy if one is accessible.
+
+Always link to a SIM microfilm copy if one is available.
+
+Use best available fulltext for search. If structured, like TEI-XML, index the
+body text separate from abstracts and references.
+
+
+## Other Ideas
+
+Do fulltext indexing at the granularity of pages, or some other segments of
+text within articles (paragraphs, chapters, sections).
+
+Fatcat already has all of Crossref, Pubmed, Arxiv, and several other
+authoritative metadata sources. But today we are missing a good chunk of
+content, particularly from institutional repositories and CS conferences (which
+don't use identifiers). Also don't have good affiliation or citation count
+coverage, and mixed/poor abstract coverage.
+
+Could use Microsoft Academic Graph (MAG) metadata corpus (or similar) to
+bootstrap with better metadata coverage.
diff --git a/proposals/web_interface.md b/proposals/web_interface.md
new file mode 100644
index 0000000..416e6fc
--- /dev/null
+++ b/proposals/web_interface.md
@@ -0,0 +1,69 @@
+
+Single domain (TBD, but eg <https://scholar.archive.org>) will host a web
+search interface. May also expose APIs on this host, or might use a separate
+host for that.
+
+Content would not be hosted on this domain; all fulltext copies would be linked
+to elsewhere.
+
+Style (eg, colors, font?) would be similar to <https://archive.org>, but may or
+may not have regular top bar (<https://web.archive.org> has this). There would
+be no "write" or "modify" features on this site at all: users would not need to
+log in. Metadata updates and features would all redirect to archive.org or
+fatcat.wiki.
+
+
+## Design and Features
+
+Will try to hew most closely to Pubmed in style, layout, and features.
+
+Only a single search interface (no separate "advanced" page). Custom query
+parser.
+
+Filtering and sort via controls under search box. A button opens a box with
+more settings. If these are persisted at all, only via cookies or local
+storage.
+
+## URL Structure
+
+All pages can be prefixed with a two-character language specifier. Default
+(with no prefix) is english.
+
+`/`: homepage, single-sentance, large search box, quick stats and info
+
+`/about`: about
+
+`/help`: FAQ?
+
+`/help/search`: advanced query tips
+
+`/search`: query and results page
+
+
+## More Ideas
+
+Things we *could* do, but maybe *shouldn't*:
+
+- journal-level metadata and summary. Could just link to fatcat.
+
+
+## APIs
+
+Might also expose as public APIs on that domain:
+
+- search
+- citation matching
+- save-paper-now
+
+
+## Implementation
+
+For first iteration, going to use:
+
+- python3.7
+- elasticsearch-dsl from python and page-load-per-query (not single-page-app)
+- fastapi (web framework)
+- jinja2 (HTML templating)
+- babel (i18n)
+- semantic-ui (CSS)
+- minimal or no javascript
diff --git a/proposals/work_schema.md b/proposals/work_schema.md
new file mode 100644
index 0000000..1e0f272
--- /dev/null
+++ b/proposals/work_schema.md
@@ -0,0 +1,96 @@
+
+## Top-Level
+
+- type: _doc
+- key: keyword
+- key_type: keyword (work or page)
+- `work_id`
+- biblio: obj
+- fulltext: obj
+- sim: obj
+- abstracts: nested
+    body
+    lang
+- releases: nested (TBD)
+- access
+- tags: array of keywords
+
+TODO:
+- summary fields to index "everything" into?
+
+## Biblio
+
+Mostly matches existing `fatcat_release` schema.
+
+- `release_id`
+- `release_revision`
+- `title`
+- `subtitle`
+- `original_title`
+- `release_date`
+- `release_year`
+- `withdrawn_status`
+- `language`
+- `country_code`
+- `volume` (etc)
+- `volume_int` (etc)
+- `first_page`
+- `first_page_int`
+- `pages`
+- `doi` etc
+- `number` (etc)
+
+NEW:
+- `preservation_status`
+
+[etc]
+
+- `license_slug`
+- `publisher` (etc)
+- `container_name` (etc)
+- `container_id`
+- `container_issnl`
+- `container_issn` (array)
+- `contrib_names`
+- `affiliations`
+- `creator_ids`
+
+## Fulltext
+
+- `status`: web, sim, shadow
+- `body`
+- `lang`
+- `file_mimetype`
+- `file_sha1`
+- `file_id`
+- `thumbnail_url`
+
+## Abstracts
+
+Nested object with:
+
+- body
+- lang
+
+For prototyping, perhaps just make it an object with `body` as an array.
+
+Only index one abstract per language.
+
+## SIM (Microfilm)
+
+Enough details to construct a link or do a lookup or whatever. Note that might
+be doing CDL status lookups on SERP pages.
+
+Also pass-through archive.org metadata here (collection-level and item-level)
+
+## Access
+
+Start with obj, but maybe later nested?
+
+- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc
+- `mimetype`
+- `access_url`
+- `file_url`
+- `file_id`
+- `release_id`
+
-- 
cgit v1.2.3