aboutsummaryrefslogtreecommitdiffstats
path: root/notes/plan.txt
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-06-04 14:42:33 -0700
committerBryan Newbold <bnewbold@archive.org>2020-06-04 14:42:33 -0700
commitf71715517c7d933859ef9a5c5df3929f78c7a93d (patch)
tree7347467957564d1835e3ae8175aae6c4680ed633 /notes/plan.txt
parent368a441618426595d451eadd8179f6fa8ecfe3e9 (diff)
downloadfatcat-scholar-f71715517c7d933859ef9a5c5df3929f78c7a93d.tar.gz
fatcat-scholar-f71715517c7d933859ef9a5c5df3929f78c7a93d.zip
add WIP notes to repo
Diffstat (limited to 'notes/plan.txt')
-rw-r--r--notes/plan.txt105
1 files changed, 105 insertions, 0 deletions
diff --git a/notes/plan.txt b/notes/plan.txt
new file mode 100644
index 0000000..9a2d998
--- /dev/null
+++ b/notes/plan.txt
@@ -0,0 +1,105 @@
+
+x write proposals
+ => overview
+ => document-per-work schema
+ => URL structure
+ => fatcat indexing pipeline
+ => microfilm indexing pipeline
+x fastapi skeleton
+ => pipenv
+ => jinja2 templates
+x sketch out elasticsearch schema
+x issue db
+x release w/ or w/o sim pipeline
+ => start with work_ident
+ => fetch releases
+ => discover/match to ia_sim item (stub)
+x sim w/o release pipeline
+ => check if there are fatcat releases for issue
+ => otherwise iterate over entire issue generating pages
+x example corpus: fatcat papers w/ GROBID TEI-XML
+ => start with covid19 corpus/pipeline
+x example corpus: sim microfilm
+x check release w/ sim pipeline
+x release index pdftotext hack
+ => test on laptop
+- indexing pipeline skeleton
+ x postgrest access
+ x intermediate "heavy" schema
+ => kafka topics and schemas
+ => minio/seaweedfs access
+- estimate fraction of SIM content with releases in fatcat ("backwards" fraction)
+
+bugs:
+x xml.etree.ElementTree.ParseError: unbound prefix: line 1, column 0
+. only some thumbnails showing?
+ => maybe because found GROBID "before" pdftotext?
+. assert 'page_numbers' in issue_meta
+. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 984: invalid continuation byte
+ => while reading pdftotext
+x contribs not coming through
+x are abstracts being searched?
+x "indexed" links broken
+x abstract JATS not getting striped
+x default type filter as "papers", not "everything"
+- container_original_name (?)
+- still leaking HTML through abstracts
+ => let's do proper highlight escapes
+ => still happening after ES filter (!)
+
+refactors:
+x abstracts in ES schema; maybe don't really need abstract alias, do in query schema?
+ => just make "object" for now
+- fetch_sim_issue / fetch_sim
+- pass through thumbnail URL
+- first_page in sim_fulltext object
+- container metadata in sim pipeline, and pass through for indexing
+
+- UI tweaks
+ . "hits" spilling over out of side bar
+ . fatcat_ident links
+ . pmcid display
+ . for debugging, a link to search doc (like a tag?)
+ . need many more schema aliases for biblio fields (eg, title); doctype for doc_type
+ . fewer but longer highlights
+ . jinja2 less whitespace (some config flag?)
+ . query in the search bar (after a search)
+ . filters actually working
+ . mobile CSS fixes
+ . larger font size
+ . search error page
+ . i18n and zh examples
+ => change "indexed" tag to an icon (or "json"), and fix QA links
+ => mobile thumbnail could use top thumbnail margin? or all actually?
+ => w3c validate
+
+- experiment: existing archive.org fulltext search, my style UI/UX
+ => merging server-side is tricky... could do async and show a JS popup?
+
+- small ideas
+ x search ranking:
+ - title boost
+ - biblio
+ - stage boost
+ - have-fulltext boost
+ https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-rank-feature-query.html
+ https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-boosting-query.html
+ https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-bool-query.html
+ https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-function-score-query.html
+ => query boost for language match
+ => query helper to inject more works
+ => 404 and 5xx handlers (web)
+ => tags: OA, SIM, "lit review", DOAJ
+ => add page_numbers to issue_db
+ => title highlighting
+ => biorxiv/medrxiv note
+ => some "indexing hacks" stage?
+ => store snippet of sim_page text to show like an abstract?
+ => user guide page with examples
+ => example queries on front page?
+ => robots.txt?
+
+- later projects/proposals
+ => pass query through to in-book reading
+ => query parser
+ => re-OCR web PDFs with poor/missing OCR