3 files changed, 168 insertions, 0 deletions
diff --git a/TODO.txt b/TODO.txt
new file mode 100644
index 0000000..f972c6a
--- /dev/null
+++ b/TODO.txt
@@ -0,0 +1,41 @@
+
+content/pipeline:
+x helper to index based on a search query
+- parallelize SIM indexing
+
+UI/UX fixes:
+x all links in new tabs
+x "keyword" in front-page box; replace identifier with examples
+x default to "availability: fulltext"
+x update front-page thumb selection (-1 PLOS)
+x brief user guide
+x "indexed" -> "json" (tag)
+x fatcat tag with link; "metadata"?
+x OA facet broken; needs tagging?
+x vertical alignment of thumbnails
+x pagination
+x filter HTML form weirdness
+    => split off "hidden" form fields
+x textpipe to escape HTML better
+    => regression test
+x group pages within issues
+x container links broken?
+x tag/tags
+x color+link OA tags. or click to filter?
+- better labeling pre-prints
+
+cleanups:
+x make fmt -> black
+x flake8
+x mypy require annotations?
+
+ponder:
+x single paragraph on front page
+x "This fulltext search index includes over 25 million research articles and other scholarly documents preserved in the Internet Archive."
+- some space-holder for missing thumbnails
+- smaller author font size
+- "search inside" phrasing
+
+data quality:
+- handle sim_issue items with multiple issues in single item (eg, issue="3-4")
+
diff --git a/notes/fatcat_sim_intersection.md b/notes/fatcat_sim_intersection.md
new file mode 100644
index 0000000..43500ec
--- /dev/null
+++ b/notes/fatcat_sim_intersection.md
@@ -0,0 +1,22 @@
+
+investigate how many fatcat releases match to SIM:
+- dump archive.org SIM collection-level metadata
+- dump archive.org issue item-level metadata
+- releases with: in_sim, volume, issue, page, year (month?)
+    => 22m   in_ia_sim
+    =>  1.1m in_ia_sim preservation:none
+    => 20m   in_ia_sim volume
+    => 20m   in_ia_sim volume year
+    => 19m   in_ia_sim volume pages
+    =>  5m   in_ia_sim volume year date
+    =>  7m   in_ia_sim volume issue
+    =>  7m   in_ia_sim volume issue pages
+    =>  6m   in_ia_sim volume issue pages first_page
+    =>  5.3m in_ia_sim volume issue pages first_page in_web:false
+    =>  0.7m in_ia_sim volume issue pages first_page preservation:none
+    =>  2.5m in_ia_sim volume issue pages first_page date
+- how many (any?) SIM journals with no fatcat container
+- how many SIM journals/issues/years with ~no fatcat releases
+
+at least some (release_jpruczlec5gsjpbc2cbvwedsdy) have updated crossref
+metadata with issue numbers
diff --git a/notes/plan.txt b/notes/plan.txt
new file mode 100644
index 0000000..9a2d998
--- /dev/null
+++ b/notes/plan.txt
@@ -0,0 +1,105 @@
+
+x write proposals
+    => overview
+    => document-per-work schema
+    => URL structure
+    => fatcat indexing pipeline
+    => microfilm indexing pipeline
+x fastapi skeleton
+    => pipenv
+    => jinja2 templates
+x sketch out elasticsearch schema
+x issue db
+x release w/ or w/o sim pipeline
+    => start with work_ident
+    => fetch releases
+    => discover/match to ia_sim item (stub)
+x sim w/o release pipeline
+    => check if there are fatcat releases for issue
+    => otherwise iterate over entire issue generating pages
+x example corpus: fatcat papers w/ GROBID TEI-XML
+    => start with covid19 corpus/pipeline
+x example corpus: sim microfilm
+x check release w/ sim pipeline
+x release index pdftotext hack
+    => test on laptop
+- indexing pipeline skeleton
+    x  postgrest access
+    x  intermediate "heavy" schema
+    => kafka topics and schemas
+    => minio/seaweedfs access
+- estimate fraction of SIM content with releases in fatcat ("backwards" fraction)
+
+bugs:
+x xml.etree.ElementTree.ParseError: unbound prefix: line 1, column 0
+. only some thumbnails showing?
+    => maybe because found GROBID "before" pdftotext?
+. assert 'page_numbers' in issue_meta
+. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 984: invalid continuation byte
+    => while reading pdftotext
+x contribs not coming through
+x are abstracts being searched?
+x "indexed" links broken
+x abstract JATS not getting striped
+x default type filter as "papers", not "everything"
+- container_original_name (?)
+- still leaking HTML through abstracts
+    => let's do proper highlight escapes
+    => still happening after ES filter (!)
+
+refactors:
+x abstracts in ES schema; maybe don't really need abstract alias, do in query schema?
+    => just make "object" for now
+- fetch_sim_issue / fetch_sim
+- pass through thumbnail URL
+- first_page in sim_fulltext object
+- container metadata in sim pipeline, and pass through for indexing
+
+- UI tweaks
+    .  "hits" spilling over out of side bar
+    .  fatcat_ident links
+    .  pmcid display
+    .  for debugging, a link to search doc (like a tag?)
+    .  need many more schema aliases for biblio fields (eg, title); doctype for doc_type
+    .  fewer but longer highlights
+    .  jinja2 less whitespace (some config flag?)
+    .  query in the search bar (after a search)
+    .  filters actually working
+    .  mobile CSS fixes
+    .  larger font size
+    .  search error page
+    .  i18n and zh examples
+    => change "indexed" tag to an icon (or "json"), and fix QA links
+    => mobile thumbnail could use top thumbnail margin? or all actually?
+    => w3c validate
+
+- experiment: existing archive.org fulltext search, my style UI/UX
+    => merging server-side is tricky... could do async and show a JS popup?
+
+- small ideas
+    x  search ranking:
+        - title boost
+        - biblio
+        - stage boost
+        - have-fulltext boost
+            https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-rank-feature-query.html
+            https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-boosting-query.html
+            https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-bool-query.html
+            https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-function-score-query.html
+    => query boost for language match
+    => query helper to inject more works
+    => 404 and 5xx handlers (web)
+    => tags: OA, SIM, "lit review", DOAJ
+    => add page_numbers to issue_db
+    => title highlighting
+    => biorxiv/medrxiv note
+        => some "indexing hacks" stage?
+    => store snippet of sim_page text to show like an abstract?
+    => user guide page with examples
+    => example queries on front page?
+    => robots.txt?
+
+- later projects/proposals
+    => pass query through to in-book reading
+    => query parser
+    => re-OCR web PDFs with poor/missing OCR