From f71715517c7d933859ef9a5c5df3929f78c7a93d Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 4 Jun 2020 14:42:33 -0700 Subject: add WIP notes to repo --- TODO.txt | 41 +++++++++++++++ notes/fatcat_sim_intersection.md | 22 ++++++++ notes/plan.txt | 105 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 168 insertions(+) create mode 100644 TODO.txt create mode 100644 notes/fatcat_sim_intersection.md create mode 100644 notes/plan.txt diff --git a/TODO.txt b/TODO.txt new file mode 100644 index 0000000..f972c6a --- /dev/null +++ b/TODO.txt @@ -0,0 +1,41 @@ + +content/pipeline: +x helper to index based on a search query +- parallelize SIM indexing + +UI/UX fixes: +x all links in new tabs +x "keyword" in front-page box; replace identifier with examples +x default to "availability: fulltext" +x update front-page thumb selection (-1 PLOS) +x brief user guide +x "indexed" -> "json" (tag) +x fatcat tag with link; "metadata"? +x OA facet broken; needs tagging? +x vertical alignment of thumbnails +x pagination +x filter HTML form weirdness + => split off "hidden" form fields +x textpipe to escape HTML better + => regression test +x group pages within issues +x container links broken? +x tag/tags +x color+link OA tags. or click to filter? +- better labeling pre-prints + +cleanups: +x make fmt -> black +x flake8 +x mypy require annotations? + +ponder: +x single paragraph on front page +x "This fulltext search index includes over 25 million research articles and other scholarly documents preserved in the Internet Archive." +- some space-holder for missing thumbnails +- smaller author font size +- "search inside" phrasing + +data quality: +- handle sim_issue items with multiple issues in single item (eg, issue="3-4") + diff --git a/notes/fatcat_sim_intersection.md b/notes/fatcat_sim_intersection.md new file mode 100644 index 0000000..43500ec --- /dev/null +++ b/notes/fatcat_sim_intersection.md @@ -0,0 +1,22 @@ + +investigate how many fatcat releases match to SIM: +- dump archive.org SIM collection-level metadata +- dump archive.org issue item-level metadata +- releases with: in_sim, volume, issue, page, year (month?) + => 22m in_ia_sim + => 1.1m in_ia_sim preservation:none + => 20m in_ia_sim volume + => 20m in_ia_sim volume year + => 19m in_ia_sim volume pages + => 5m in_ia_sim volume year date + => 7m in_ia_sim volume issue + => 7m in_ia_sim volume issue pages + => 6m in_ia_sim volume issue pages first_page + => 5.3m in_ia_sim volume issue pages first_page in_web:false + => 0.7m in_ia_sim volume issue pages first_page preservation:none + => 2.5m in_ia_sim volume issue pages first_page date +- how many (any?) SIM journals with no fatcat container +- how many SIM journals/issues/years with ~no fatcat releases + +at least some (release_jpruczlec5gsjpbc2cbvwedsdy) have updated crossref +metadata with issue numbers diff --git a/notes/plan.txt b/notes/plan.txt new file mode 100644 index 0000000..9a2d998 --- /dev/null +++ b/notes/plan.txt @@ -0,0 +1,105 @@ + +x write proposals + => overview + => document-per-work schema + => URL structure + => fatcat indexing pipeline + => microfilm indexing pipeline +x fastapi skeleton + => pipenv + => jinja2 templates +x sketch out elasticsearch schema +x issue db +x release w/ or w/o sim pipeline + => start with work_ident + => fetch releases + => discover/match to ia_sim item (stub) +x sim w/o release pipeline + => check if there are fatcat releases for issue + => otherwise iterate over entire issue generating pages +x example corpus: fatcat papers w/ GROBID TEI-XML + => start with covid19 corpus/pipeline +x example corpus: sim microfilm +x check release w/ sim pipeline +x release index pdftotext hack + => test on laptop +- indexing pipeline skeleton + x postgrest access + x intermediate "heavy" schema + => kafka topics and schemas + => minio/seaweedfs access +- estimate fraction of SIM content with releases in fatcat ("backwards" fraction) + +bugs: +x xml.etree.ElementTree.ParseError: unbound prefix: line 1, column 0 +. only some thumbnails showing? + => maybe because found GROBID "before" pdftotext? +. assert 'page_numbers' in issue_meta +. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 984: invalid continuation byte + => while reading pdftotext +x contribs not coming through +x are abstracts being searched? +x "indexed" links broken +x abstract JATS not getting striped +x default type filter as "papers", not "everything" +- container_original_name (?) +- still leaking HTML through abstracts + => let's do proper highlight escapes + => still happening after ES filter (!) + +refactors: +x abstracts in ES schema; maybe don't really need abstract alias, do in query schema? + => just make "object" for now +- fetch_sim_issue / fetch_sim +- pass through thumbnail URL +- first_page in sim_fulltext object +- container metadata in sim pipeline, and pass through for indexing + +- UI tweaks + . "hits" spilling over out of side bar + . fatcat_ident links + . pmcid display + . for debugging, a link to search doc (like a tag?) + . need many more schema aliases for biblio fields (eg, title); doctype for doc_type + . fewer but longer highlights + . jinja2 less whitespace (some config flag?) + . query in the search bar (after a search) + . filters actually working + . mobile CSS fixes + . larger font size + . search error page + . i18n and zh examples + => change "indexed" tag to an icon (or "json"), and fix QA links + => mobile thumbnail could use top thumbnail margin? or all actually? + => w3c validate + +- experiment: existing archive.org fulltext search, my style UI/UX + => merging server-side is tricky... could do async and show a JS popup? + +- small ideas + x search ranking: + - title boost + - biblio + - stage boost + - have-fulltext boost + https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-rank-feature-query.html + https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-boosting-query.html + https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-bool-query.html + https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-function-score-query.html + => query boost for language match + => query helper to inject more works + => 404 and 5xx handlers (web) + => tags: OA, SIM, "lit review", DOAJ + => add page_numbers to issue_db + => title highlighting + => biorxiv/medrxiv note + => some "indexing hacks" stage? + => store snippet of sim_page text to show like an abstract? + => user guide page with examples + => example queries on front page? + => robots.txt? + +- later projects/proposals + => pass query through to in-book reading + => query parser + => re-OCR web PDFs with poor/missing OCR -- cgit v1.2.3