From f5a883642dd114ac2c29c72348bed05616189aa2 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Mon, 11 May 2020 19:12:13 -0700 Subject: start sketching proposals --- proposals/fatcat_indexing_pipeline.md | 54 ++++++++++++++++++ proposals/microfilm_indexing_pipeline.md | 30 ++++++++++ proposals/overview.md | 38 +++++++++++++ proposals/web_interface.md | 69 +++++++++++++++++++++++ proposals/work_schema.md | 96 ++++++++++++++++++++++++++++++++ 5 files changed, 287 insertions(+) create mode 100644 proposals/fatcat_indexing_pipeline.md create mode 100644 proposals/microfilm_indexing_pipeline.md create mode 100644 proposals/overview.md create mode 100644 proposals/web_interface.md create mode 100644 proposals/work_schema.md diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md new file mode 100644 index 0000000..deafb65 --- /dev/null +++ b/proposals/fatcat_indexing_pipeline.md @@ -0,0 +1,54 @@ + +## High-Level + +Work-oriented: base input is arrays of expanded releases, all from the same +work. + +Re-index pipeline would look at fatcat changelog or existing release feed, and +use the `work_id` to fetch all other releases. + +Batch indexing pipeline would use a new variant of `fatcat-export` which is +expanded releases (one-per-line), grouped (or sorted) by work id. + +Then, pipeline looks like: + +- choose canonical release +- choose best access +- choose best fulltext file + => iterate releases and files + => soft prefer canonical release, file access, release_date, etc + => check via postgrest query that fulltext is available + => fetch raw fulltext +- check if we expect a SIM copy to exist + => eg, using an issue db? + => if so, fetch petabox metadata and try to confirm, so we can create a URL + => if we don't have another fulltext source (?): + => fetch djvu file and extract the pages in question (or just 1 if unsure?) +- output "heavy" object + +Next step is: + +- summarize biblio metadata +- select one abstract per language +- sanitize abstracts and fulltext content for indexing +- compute counts, epistimological quality, etc + +The output of that goes to Kafka for indexing into ES. + +This indexing process is probably going to be both CPU and network intensive. +In python will want multiprocessing and maybe also async? + +## Implementation + +Existing tools/libraries: + +- fatcat-openapi-client +- postgrest client +- S3/minio/seaweed client +- ftfy +- language detection + +New needed (eventually): + +- strip latex +- strip JATS or HTML diff --git a/proposals/microfilm_indexing_pipeline.md b/proposals/microfilm_indexing_pipeline.md new file mode 100644 index 0000000..657aae2 --- /dev/null +++ b/proposals/microfilm_indexing_pipeline.md @@ -0,0 +1,30 @@ + +## High-Level + +- operate on an entire item +- check against issue DB and/or fatcat search + => if there is fatcat work-level metadata for this issue, skip +- fetch collection-level (journal) metadata +- iterate through djvu text file: + => convert to simple text + => filter out non-research pages using quick heuristics + => try looking up "real" page number from OCR work (in item metadata) +- generate "heavy" intermediate schema (per valid page): + => fatcat container metadata + => ia collection (journal) metadata + => item metadata + => page fulltext and any metadata + +- transform "heavy" intermediates to ES schema + +## Implementation + +Existing tools and libraries: + +- internetarchive python tool to fetch files and item metadata +- fatcat API client for container metadata lookup + +New tools or libraries needed: + +- issue DB or use fatcat search index to count releases by volume/issue +- djvu XML parser diff --git a/proposals/overview.md b/proposals/overview.md new file mode 100644 index 0000000..fa8148c --- /dev/null +++ b/proposals/overview.md @@ -0,0 +1,38 @@ + + +Can be multiple releases for each work: + +- required: most canonical published version ("version of record", what would be cited) + => or, most updated? +- optional: mostly openly accessible version +- optional: updated version + => errata, corrected version, or retraction +- optional: fulltext indexed version + => might be not the most updated, or no accessible + + +## Initial Plan + +Index all fatcat works in catalog. + +Always link to a born-digital copy if one is accessible. + +Always link to a SIM microfilm copy if one is available. + +Use best available fulltext for search. If structured, like TEI-XML, index the +body text separate from abstracts and references. + + +## Other Ideas + +Do fulltext indexing at the granularity of pages, or some other segments of +text within articles (paragraphs, chapters, sections). + +Fatcat already has all of Crossref, Pubmed, Arxiv, and several other +authoritative metadata sources. But today we are missing a good chunk of +content, particularly from institutional repositories and CS conferences (which +don't use identifiers). Also don't have good affiliation or citation count +coverage, and mixed/poor abstract coverage. + +Could use Microsoft Academic Graph (MAG) metadata corpus (or similar) to +bootstrap with better metadata coverage. diff --git a/proposals/web_interface.md b/proposals/web_interface.md new file mode 100644 index 0000000..416e6fc --- /dev/null +++ b/proposals/web_interface.md @@ -0,0 +1,69 @@ + +Single domain (TBD, but eg ) will host a web +search interface. May also expose APIs on this host, or might use a separate +host for that. + +Content would not be hosted on this domain; all fulltext copies would be linked +to elsewhere. + +Style (eg, colors, font?) would be similar to , but may or +may not have regular top bar ( has this). There would +be no "write" or "modify" features on this site at all: users would not need to +log in. Metadata updates and features would all redirect to archive.org or +fatcat.wiki. + + +## Design and Features + +Will try to hew most closely to Pubmed in style, layout, and features. + +Only a single search interface (no separate "advanced" page). Custom query +parser. + +Filtering and sort via controls under search box. A button opens a box with +more settings. If these are persisted at all, only via cookies or local +storage. + +## URL Structure + +All pages can be prefixed with a two-character language specifier. Default +(with no prefix) is english. + +`/`: homepage, single-sentance, large search box, quick stats and info + +`/about`: about + +`/help`: FAQ? + +`/help/search`: advanced query tips + +`/search`: query and results page + + +## More Ideas + +Things we *could* do, but maybe *shouldn't*: + +- journal-level metadata and summary. Could just link to fatcat. + + +## APIs + +Might also expose as public APIs on that domain: + +- search +- citation matching +- save-paper-now + + +## Implementation + +For first iteration, going to use: + +- python3.7 +- elasticsearch-dsl from python and page-load-per-query (not single-page-app) +- fastapi (web framework) +- jinja2 (HTML templating) +- babel (i18n) +- semantic-ui (CSS) +- minimal or no javascript diff --git a/proposals/work_schema.md b/proposals/work_schema.md new file mode 100644 index 0000000..1e0f272 --- /dev/null +++ b/proposals/work_schema.md @@ -0,0 +1,96 @@ + +## Top-Level + +- type: _doc +- key: keyword +- key_type: keyword (work or page) +- `work_id` +- biblio: obj +- fulltext: obj +- sim: obj +- abstracts: nested + body + lang +- releases: nested (TBD) +- access +- tags: array of keywords + +TODO: +- summary fields to index "everything" into? + +## Biblio + +Mostly matches existing `fatcat_release` schema. + +- `release_id` +- `release_revision` +- `title` +- `subtitle` +- `original_title` +- `release_date` +- `release_year` +- `withdrawn_status` +- `language` +- `country_code` +- `volume` (etc) +- `volume_int` (etc) +- `first_page` +- `first_page_int` +- `pages` +- `doi` etc +- `number` (etc) + +NEW: +- `preservation_status` + +[etc] + +- `license_slug` +- `publisher` (etc) +- `container_name` (etc) +- `container_id` +- `container_issnl` +- `container_issn` (array) +- `contrib_names` +- `affiliations` +- `creator_ids` + +## Fulltext + +- `status`: web, sim, shadow +- `body` +- `lang` +- `file_mimetype` +- `file_sha1` +- `file_id` +- `thumbnail_url` + +## Abstracts + +Nested object with: + +- body +- lang + +For prototyping, perhaps just make it an object with `body` as an array. + +Only index one abstract per language. + +## SIM (Microfilm) + +Enough details to construct a link or do a lookup or whatever. Note that might +be doing CDL status lookups on SERP pages. + +Also pass-through archive.org metadata here (collection-level and item-level) + +## Access + +Start with obj, but maybe later nested? + +- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc +- `mimetype` +- `access_url` +- `file_url` +- `file_id` +- `release_id` + -- cgit v1.2.3