notes/plan.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105


x write proposals
    => overview
    => document-per-work schema
    => URL structure
    => fatcat indexing pipeline
    => microfilm indexing pipeline
x fastapi skeleton
    => pipenv
    => jinja2 templates
x sketch out elasticsearch schema
x issue db
x release w/ or w/o sim pipeline
    => start with work_ident
    => fetch releases
    => discover/match to ia_sim item (stub)
x sim w/o release pipeline
    => check if there are fatcat releases for issue
    => otherwise iterate over entire issue generating pages
x example corpus: fatcat papers w/ GROBID TEI-XML
    => start with covid19 corpus/pipeline
x example corpus: sim microfilm
x check release w/ sim pipeline
x release index pdftotext hack
    => test on laptop
- indexing pipeline skeleton
    x  postgrest access
    x  intermediate "heavy" schema
    => kafka topics and schemas
    => minio/seaweedfs access
- estimate fraction of SIM content with releases in fatcat ("backwards" fraction)

bugs:
x xml.etree.ElementTree.ParseError: unbound prefix: line 1, column 0
. only some thumbnails showing?
    => maybe because found GROBID "before" pdftotext?
. assert 'page_numbers' in issue_meta
. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 984: invalid continuation byte
    => while reading pdftotext
x contribs not coming through
x are abstracts being searched?
x "indexed" links broken
x abstract JATS not getting striped
x default type filter as "papers", not "everything"
- container_original_name (?)
- still leaking HTML through abstracts
    => let's do proper highlight escapes
    => still happening after ES filter (!)

refactors:
x abstracts in ES schema; maybe don't really need abstract alias, do in query schema?
    => just make "object" for now
- fetch_sim_issue / fetch_sim
- pass through thumbnail URL
- first_page in sim_fulltext object
- container metadata in sim pipeline, and pass through for indexing

- UI tweaks
    .  "hits" spilling over out of side bar
    .  fatcat_ident links
    .  pmcid display
    .  for debugging, a link to search doc (like a tag?)
    .  need many more schema aliases for biblio fields (eg, title); doctype for doc_type
    .  fewer but longer highlights
    .  jinja2 less whitespace (some config flag?)
    .  query in the search bar (after a search)
    .  filters actually working
    .  mobile CSS fixes
    .  larger font size
    .  search error page
    .  i18n and zh examples
    => change "indexed" tag to an icon (or "json"), and fix QA links
    => mobile thumbnail could use top thumbnail margin? or all actually?
    => w3c validate

- experiment: existing archive.org fulltext search, my style UI/UX
    => merging server-side is tricky... could do async and show a JS popup?

- small ideas
    x  search ranking:
        - title boost
        - biblio
        - stage boost
        - have-fulltext boost
            https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-rank-feature-query.html
            https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-boosting-query.html
            https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-bool-query.html
            https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-function-score-query.html
    => query boost for language match
    => query helper to inject more works
    => 404 and 5xx handlers (web)
    => tags: OA, SIM, "lit review", DOAJ
    => add page_numbers to issue_db
    => title highlighting
    => biorxiv/medrxiv note
        => some "indexing hacks" stage?
    => store snippet of sim_page text to show like an abstract?
    => user guide page with examples
    => example queries on front page?
    => robots.txt?

- later projects/proposals
    => pass query through to in-book reading
    => query parser
    => re-OCR web PDFs with poor/missing OCR