diff options
-rw-r--r-- | TODO | 53 |
1 files changed, 21 insertions, 32 deletions
@@ -1,35 +1,28 @@ ## In Progress -- QA data checks - x dump: SQL and fatcat-export - => elastic transform and esbulk load - => 'container' metadata - => release in_* flags (updated kibana dashboard?) - => run crossref auto-import pipeline components - => wayback duplication and short datetimes - => re-run crossref non-bezerk; ensure no new entities -- log Warning headers returned to user, as a QA check? - => guess this would be rust middleware - -from running tests: -Jan 28 18:57:27.431 INFO POST http://localhost:9411/v0/creator/batch?autoaccept=True&description=test+description&extra=%7B%27q%27%3A+%27thing%27%2C+%27a%27%3A+75%7D 500 Internal Server Error (1 ms) -Jan 28 18:57:27.438 INFO POST http://localhost:9411/v0/creator/batch?autoaccept=True&description=test+description&extra=%7B 500 Internal Server Error (3 ms) +- attempt prod import (in QA)! +## Prod Metadata Checks + +- longtail_oa flag getting set on GROBID imports +- crossref citation not saving 'article-title' or 'unstructured', and 'author' + should be 'authors' (list) +- crossref not saving 'language' (looks like iso code already) +- grobid reference should be under extra (not nested): issue, volume, authors ## Next Up +- serveral tweaks/fixes to webface (eg, container metadata schema changed) - container count "enrich" - changelog elastic stuff (is there even a fatcat-export for this?) - QA sentry has very little host info; also not URL of request - start prod crossref harvesting (from ~start of 2019) - 158 "NULL" publishers in journal metadata - -## Production import blockers - -- URL location duplication (especially IA/wayback) - => eg, https://fatcat.wiki/file/2g4sz57j3bgcfpwkgz5bome3re - => UNIQ index on {release_rev, url}? +- should elastic release_year be of date type, instead of int? +- QA/prod needs updated credentials +- ansible: ISSN-L download/symlink +- searching 'N/A' is a bug ## Production public launch blockers @@ -80,10 +73,14 @@ Jan 28 18:57:27.438 INFO POST http://localhost:9411/v0/creator/batch?autoaccept= - web.archive.org response not SHA1 match? => need /<dt>id_/ thing - XML etc in metadata => (python) tests for these! - https://qa.fatcat.wiki/release/b3a2jvhvbvc6rlbdkpw4ukuzyi https://qa.fatcat.wiki/release/search?q=xmlns - https://qa.fatcat.wiki/release/search?q=%26amp%3B - https://qa.fatcat.wiki/release/search?q=%26gt%3B + https://qa.fatcat.wiki/release/search?q=%24gt +- bad/weird titles + "[Blank page]", "blank page" + "Temporary Empty DOI 0" + "ADVERTISEMENT" + "Full title page with Editorial board (with Elsevier tree)" + "Advisory Board Editorial Board" - better/complete reltypes probably good (eg, list of IRs, academic domain) - 'expand' in lookups (derp! for single hit lookups) - include crossref-capitalized DOI in extra @@ -91,18 +88,10 @@ Jan 28 18:57:27.438 INFO POST http://localhost:9411/v0/creator/batch?autoaccept= => also title https://fatcat.wiki/release/uyjzaq3xjnd6tcrqy3vcucczsi - crossref import: don't store citation unstructured if len() == 0: {"crossref": {"unstructured": ""}} -- cleaning/matching: https://ftfy.readthedocs.io/en/latest/ - => and try out beautifulsoup (https://stackoverflow.com/a/34532382/4682349) +- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349) - manifest: multiple URLs per SHA1 - crossref: relations ("is-preprint-of") - crossref: two phase: no citations, then matched citations (via DOI table) -- container import (extra?): lang, region, subject -- crossref: filter works - => content-type whitelist - => title length and title/slug blacklist - => at least one author (?) - => make this a method on Release object - => or just set release_type as "stub"? - special "alias" DOIs... in crossref metadata? new importers: |