blob: 68618a4a431fade356805fbf2aa6b44a836b9552 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
- have i18n stuff set 'Content-Language' in response header
- only add id_ to PDF replay (not other content types)
- add 'hdl' to schema (release extid)
human names:
https://www.librarian.net/stax/5222/what-you-learn-in-library-school-whats-in-a-name/
- refactor publisher domain/link lookup into python code (not jinja2 template logic)
- async elasticsearch: https://elasticsearch-py.readthedocs.io/en/v7.10.1/async.html
=> not until elasticsearch-dsl support exists
- adopt a better "remove tags" library in clean_str() (replace bs4)
https://stackoverflow.com/a/57648173/4682349
work in progress:
- SIM respect 'noindex' (verify)
- SIM fetch much less metadata (changes in API?)
copy editing:
- "how it works" page
- "Contribute" -> "How To Participate"
- web.archive.org not found -> resolved
workers:
- fetch worker: filter by changelog or 'updated' datetime
- work deletion: some bundle version/variant to allow deleting ES documents?
- SIM updater
x what is the process for updating issue DB? cronjob? at least have a makefile target
=> not really a problem now that SIM is ~done
SIM:
x SIM parallelism
x fatcat lookup: ISSN/ISSN-L
- ambiguity. sim_pubid only extra? are both ISSNs available?
- add page count to issn-db (?)
content/pipeline:
- add gzip to intermediate files pipeline commands
- makefile targets for bulk ingest
cleanups:
x "web assets" (CSS etc) in this repo or on *.archive.org in general
x have "json to IntermediateBundle" be a helper method, instead of multiple implementations
- better typing/annotation of work pipeline
- test coverage
- use settings.toml for defaults of CLI args
ponder:
- "search inside" phrasing
- "counts" target to summarize (to console)
- Onion-Location header
- <meta> description
- canonical links?
- jinja2: "if xyz is defined" better than "if xyz"
- "default" translation option (clear prefix, use browser default)
- should SERP page have an <h1>? "Search Results"? hidden?
data quality:
- handle sim_issue items with multiple issues in single item (eg, issue="3-4")
|