## In Progress - update existing 1.5 mil longtail OA PDFs with container/ISSN-L ## Next Up ## Bugs - did, somehow, end up with web.archive.org/web/None/ URLs (should remove) - searching 'N/A' is a bug, because not quoted; auto-quote it? - author (contrib) names not getting included in search (unless explicit) - fatcat flask lookup ValueError should return 4xx (and message?) ## Next Schema Iteration (0.3.0) Changes to SQL (and swagger): - structured names in contribs (given/sur) - `release_status` => `release_stage` - `withdrawn_date`, `withdrawn_state`, and retraction as a release stage - subtitle as a string field => but what about translation? `original_subtitle`? just combine them? => combine in elasticsearch 'title' field - size on webcapture CDX lines (we fetch for sha256 anyways, so easy to calculate) - `ark_id` release identifier - `mag_id` (microsoft academic graph) release identifier - releases: 'number' (eg, report numbers) and 'version' (for numbered variants) fields - missing SQL indices: `ENTITY_edit.editgroup_id, ENTITY_edit.ident_id` Changes to swagger only: - edit URLs: `editgroup_id` in URL, not a query param - changelog API endpoint should needs expand=editors option - include 'created' in editgroup object (already in SQL) ## Next Full Release "Touch" Will update all release entities (or at least all Crossref-derived entities). Want to minimize edit counts, so will bundle a bunch of changes - structured contrib names (given, sur) - reference linking (release-to-release), via crossref DOI refs - subtitle as string, not array - remove crossref alt ids that are just the DOI (?) ## Production Public Launch Blockers - view edit revisions in webface - audit fatcat metadata for CC-0 - guide updates for auth - privacy policy, and link from: create account, create edit ## Production Tech Sanity - postgresql replication - haproxy somewhere/how - logging iteration: larger journald buffers? point somewhere? ## Unsorted - ability to "edit edits" (update in-progress edits) - review bots: - tests - not actually processing work entities - filter out already reviewed - handle deletions, merges - examples of warnings, etc - missing test coverage (python): batch create work, fileset, webcapture delete entity (for each entity type) delete entity edits (for each entity type) get entity edit (for each entity type) get entity redirects (for each entity type) get entity revision (for each entity type) get release webcaptures update editor (?) update fileset, webcapture release elastic transform (rich extra) successful web entity edits (create fresh entities first) editgroup web submit, accept, annotate - API: ability to expand containers (and files, etc?) in releases-for-work - API: /releases endpoint (and/or expansion) for releases-for-file (etc) - cleanup ./notes/ directory - links say "Download ..." but open in same page, not download - workers (like entity updater) should use env vars more - ansible: ISSN-L download/symlink - page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude - QA sentry has very little host info; also not URL of request - elastic schemas: release: drop revision?; container_id; creator_id should `release_year` be of date type, instead of int? files: domain list; mimetype; release count; url count; web/publisher/etc; size; has_md5/sha256/sha1; in_ia, in_shadow - should elastic `release_year` be of date type, instead of int? - webface: still need to collapse links by domain better, and also vs. www.x/x - entity edit JSON objects could include `entity_type` - refactor 'fatcatd' to 'fatcat-api' - changelog elastic stuff (is there even a fatcat-export for this?) - container count "enrich" - 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps) - https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/ - changelog elastic index (for stats) - API: allow deletion of empty, un-accepted editgroups ## Ideas - `poster` as a `release_type` - "revert editgroup" mechanism (creates new editgroup) - can guess some `release_status` of files by looking at wayback date vs. published date - ORCID apparently has 37 mil "work activities" (patents, etc), and only 14 mil unique DOIs; could import those other "work activities"? do they have identifiers? - use https://github.com/codelucas/newspaper to extract fulltext+metadata from HTML crawls - `fatcat-auth` tool should support more caveats, both when generating new or mutating existing tokens - fast path to skip recursive redirect checks for bulk inserts - when getting "wip" entities, require a parameter ("allow_wip"), else get a 404 - maybe better 'success' return message? eg, "success: true" flag - idea: allow users to generate their own editgroup UUIDs, to reduce a round trips and "hanging" editgroups (created but never edited) - refactor API schema for some entity-generic methos (eg, history, edit operations) to take entity type as a URL path param. greatly reduce macro foolery and method count/complexity, and ease creation of new entities => /{entity}/edit/{edit_id} => /{entity}/{ident}/redirects => /{entity}/{ident}/history - investigate data quality by looking at, eg, most popular author strings, most popular titles, duplicated containers, etc ## Metadata Import - 158 "NULL" publishers in journal metadata - crossref: many ISBNs not getting copied; use python library to convert? - remove 'first' from contrib crossref 'seq' (not helpful?) - should probably check for 'jats:' in abstract before setting mimetype, even from crossref - web.archive.org response not SHA1 match? => need /