blob: 794349fbd86748f42428bb3434d3d90f1c8a6354 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
- indexing pipeline
=> kafka topics and schemas
- estimate fraction of SIM content with releases in fatcat ("backwards" fraction)
bugs:
. only some thumbnails showing?
=> maybe because found GROBID "before" pdftotext?
. assert 'page_numbers' in issue_meta
- container_original_name (?)
refactors:
- fetch_sim_issue / fetch_sim
- first_page in sim_fulltext object
- container metadata in sim pipeline, and pass through for indexing
- UI tweaks
=> w3c validate
- experiment: existing archive.org fulltext search, my style UI/UX
=> merging server-side is tricky... could do async and show a JS popup?
- small ideas
=> query boost for language match
=> query helper to inject more works
=> 404 and 5xx handlers (web)
=> tags: OA, SIM, "lit review", DOAJ
=> add page_numbers to issue_db
=> title highlighting
=> biorxiv/medrxiv note
=> some "indexing hacks" stage?
=> store snippet of sim_page text to show like an abstract?
=> user guide page with examples
=> example queries on front page?
=> robots.txt?
- later projects/proposals
=> query parser
=> pass query through to in-book reading
=> re-OCR web PDFs with poor/missing OCR
|