## In Progress - basic python tests for editgroup, annotation, submission changes - python tests for new autoaccept behavior - python tests for citation table storage efficiency changes => should there be a distinction between empty list and no references? yes, eg if expanded or not hidden => postgres manual checks that this is working => also benchmark (both speed and efficiency) ## Next Up - "don't clobber" mode/flag for crossref import (and others?) - update_file requires 'id'. should it be 'ident'? => something different about file vs. release - guide updates for auth - refactor webface views to use shared entity_view.html template - handle 'wip' status entities in web UI - elastic inserter should handle deletions and redirects; if state isn't active, delete the document => don't delete, just store state. but need to "blank" redirects and WIP so they don't show up in results => refactor inserter to be a class (eg, for command line use) => end-to-end test of this behavior? - date handling is really pretty bad for releases; mangling those Jan1/Dec31 => elastic schema should have a year field (integer) - document: elastic query date syntax is like: date:[2018-10-01 TO 2018-12-31] - elastic transform should only include authors, not editors (?) - webcapture timestamp schema cleanup (both CDX and base) => dt.to_rfc3339_opts(SecondsFormat::Secs, true) => but this is mostly buried in serialization code? - fake DOI (use in examples): 10.5555/12345678 - URL location duplication (especially IA/wayback) => eg, https://fatcat.wiki/file/2g4sz57j3bgcfpwkgz5bome3re => UNIQ index on {release_rev, url}? - shadow library manifest importer - import from arabesque output (eg, specific crawls) - elastic iteration => any_abstract broken? => blank author names? maybe in crossref import; fatcat-api and schema should both prevent - handle very large author/reference lists (instead of dropping) => https://api.crossref.org/v1/works/http://dx.doi.org/10.1007/978-3-319-46095-6_7 => 7000+ authors (!) ## Bugs (or at least need tests) - autoaccept seems to have silently not actually merged editgroup ## Ideas - more logins: orcid, wikimedia - `fatcat-auth` tool should support more caveats, both when generating new or mutating existing tokens - fast path to skip recursive redirect checks for bulk inserts - when getting "wip" entities, require a parameter ("allow_wip"), else get a 404 - consider dropping CORE identifier - maybe better 'success' return message? eg, "success: true" flag - idea: allow users to generate their own editgroup UUIDs, to reduce a round trips and "hanging" editgroups (created but never edited) - API: allow deletion of empty, un-accepted editgroups - refactor API schema for some entity-generic methos (eg, history, edit operations) to take entity type as a URL path param. greatly reduce macro foolery and method count/complexity, and ease creation of new entities => /{entity}/edit/{edit_id} => /{entity}/{ident}/redirects => /{entity}/{ident}/history - investigate data quality by looking at, eg, most popular author strings, most popular titles, duplicated containers, etc ## Production blockers - privacy policy, and link from: create account, create edit - update /about page - refactors and correctness in rust/TODO - importers: don't insert wayback links with short timestamps ## Production Sanity - fatcat-web is not Type=simple (systemd) - postgresql replication - pg_dump/load test - haproxy somewhere/how - logging iteration: larger journald buffers? point somewhere? ## Metadata Import - web.archive.org response not SHA1 match? => need /
id_/ thing - XML etc in metadata => (python) tests for these! https://qa.fatcat.wiki/release/b3a2jvhvbvc6rlbdkpw4ukuzyi https://qa.fatcat.wiki/release/search?q=xmlns https://qa.fatcat.wiki/release/search?q=%26amp%3B https://qa.fatcat.wiki/release/search?q=%26gt%3B - better/complete reltypes probably good (eg, list of IRs, academic domain) - 'expand' in lookups (derp! for single hit lookups) - include crossref-capitalized DOI in extra - some "Elsevier " stuff as publisher => also title https://fatcat.wiki/release/uyjzaq3xjnd6tcrqy3vcucczsi - crossref import: don't store citation unstructured if len() == 0: {"crossref": {"unstructured": ""}} - cleaning/matching: https://ftfy.readthedocs.io/en/latest/ => and try out beautifulsoup (https://stackoverflow.com/a/34532382/4682349) - manifest: multiple URLs per SHA1 - crossref: relations ("is-preprint-of") - crossref: two phase: no citations, then matched citations (via DOI table) - container import (extra?): lang, region, subject - crossref: filter works => content-type whitelist => title length and title/slug blacklist => at least one author (?) => make this a method on Release object => or just set release_type as "stub"? - special "alias" DOIs... in crossref metadata? new importers: - pubmed (medline) (filtered) => and/or, use pubmed ID lookups on crossref import - arxiv.org - DOAJ - CORE (filtered) - semantic scholar (up to 39 million; includes author de-dupe) ## Entity/Edit Lifecycle - commenting and accepting editgroups - editgroup state machine? ## Guide / Book / Style - release_type, release_status, url.rel schemas (enforced in API) - more+better terms+policies: https://tosdr.org/index.html ## Fun Features - "save paper now" => is it in GWB? if not, SPN => get hash + url from GWB, verify mimetype acceptable => is file in fatcat? => what about HBase? GROBID? => create edit, redirect user to editgroup submit page - python client tool and library in pypi => or maybe rust? - bibtext (etc) export ## Metadata Harvesting - datacite ingest seems to have failed... got a non-HTTP-200 status code, but also "got 50 (161950 of 21084)" ## Schema / Entity Fields - arxiv_id field (keep flip-flopping) - original_title field (internationalization, "original language") - `doi` field for containers (at least for "journal" type; maybe for "series" as well?) - `retracted`, `translation`, and perhaps `corrected` as flags on releases, instead of release_status? - 'part-of' relation for releases (release to release) and possibly containers - `container_type` field for containers (journal, conference, book series, etc) ## Other / Backburner - fileset/webcapture webface anything - display abstracts better. no hashes or metadata; prefer plain or HTML, convert JATS if necessary - switch from slog to simple pretty_env_log - format returned datetimes with only second precision, not millisecond (RFC mode) => burried in model serialization internals - refactor openapi schema to use shared response types - consider using "HTTP 202: Accepted" for entity-mutating calls - basic python hbase/elastic matcher => takes sha1 keys => checks fatcat API + hbase => if not matched yet, tries elastic search => simple ~exact match heuristic => proof-of-concept, no tests - add_header Strict-Transport-Security "max-age=3600"; => 12 hours? 24? - haproxy for rate-limiting better API docs - readme.io has a free open source plan (or at least used to) - https://github.com/readmeio/api-explorer - https://github.com/lord/slate - https://sourcey.com/spectacle/ - https://github.com/DapperDox/dapperdox CSL: - https://citationstyles.org/ - https://github.com/citation-style-language/documentation/blob/master/primer.txt - https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html - https://github.com/citation-style-language/schema/blob/master/csl-types.rnc - perhaps a "create from CSL" endpoint?