From ac1c97af86e4072cf898e46de61bea9a2bfe0b93 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Mon, 29 Jun 2020 22:02:52 -0700 Subject: update plan doc --- notes/plan.txt | 69 ++-------------------------------------------------------- 1 file changed, 2 insertions(+), 67 deletions(-) (limited to 'notes') diff --git a/notes/plan.txt b/notes/plan.txt index 9a2d998..794349f 100644 --- a/notes/plan.txt +++ b/notes/plan.txt @@ -1,91 +1,26 @@ -x write proposals - => overview - => document-per-work schema - => URL structure - => fatcat indexing pipeline - => microfilm indexing pipeline -x fastapi skeleton - => pipenv - => jinja2 templates -x sketch out elasticsearch schema -x issue db -x release w/ or w/o sim pipeline - => start with work_ident - => fetch releases - => discover/match to ia_sim item (stub) -x sim w/o release pipeline - => check if there are fatcat releases for issue - => otherwise iterate over entire issue generating pages -x example corpus: fatcat papers w/ GROBID TEI-XML - => start with covid19 corpus/pipeline -x example corpus: sim microfilm -x check release w/ sim pipeline -x release index pdftotext hack - => test on laptop -- indexing pipeline skeleton - x postgrest access - x intermediate "heavy" schema +- indexing pipeline => kafka topics and schemas - => minio/seaweedfs access - estimate fraction of SIM content with releases in fatcat ("backwards" fraction) bugs: -x xml.etree.ElementTree.ParseError: unbound prefix: line 1, column 0 . only some thumbnails showing? => maybe because found GROBID "before" pdftotext? . assert 'page_numbers' in issue_meta -. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 984: invalid continuation byte - => while reading pdftotext -x contribs not coming through -x are abstracts being searched? -x "indexed" links broken -x abstract JATS not getting striped -x default type filter as "papers", not "everything" - container_original_name (?) -- still leaking HTML through abstracts - => let's do proper highlight escapes - => still happening after ES filter (!) refactors: -x abstracts in ES schema; maybe don't really need abstract alias, do in query schema? - => just make "object" for now - fetch_sim_issue / fetch_sim -- pass through thumbnail URL - first_page in sim_fulltext object - container metadata in sim pipeline, and pass through for indexing - UI tweaks - . "hits" spilling over out of side bar - . fatcat_ident links - . pmcid display - . for debugging, a link to search doc (like a tag?) - . need many more schema aliases for biblio fields (eg, title); doctype for doc_type - . fewer but longer highlights - . jinja2 less whitespace (some config flag?) - . query in the search bar (after a search) - . filters actually working - . mobile CSS fixes - . larger font size - . search error page - . i18n and zh examples - => change "indexed" tag to an icon (or "json"), and fix QA links - => mobile thumbnail could use top thumbnail margin? or all actually? => w3c validate - experiment: existing archive.org fulltext search, my style UI/UX => merging server-side is tricky... could do async and show a JS popup? - small ideas - x search ranking: - - title boost - - biblio - - stage boost - - have-fulltext boost - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-rank-feature-query.html - https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-boosting-query.html - https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-bool-query.html - https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-function-score-query.html => query boost for language match => query helper to inject more works => 404 and 5xx handlers (web) @@ -100,6 +35,6 @@ x abstracts in ES schema; maybe don't really need abstract alias, do in query sc => robots.txt? - later projects/proposals - => pass query through to in-book reading => query parser + => pass query through to in-book reading => re-OCR web PDFs with poor/missing OCR -- cgit v1.2.3