blob: 8f3e7519065927382b459b8a629a9ece4558b94c (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
|
## Upstream Dumps
Open Library does monthly bulk dumps: <https://archive.org/details/ol_exports?sort=-publicdate>
Latest work dump: <https://openlibrary.org/data/ol_dump_works_latest.txt.gz>
TSV columns:
type - type of record (/type/edition, /type/work etc.)
key - unique key of the record. (/books/OL1M etc.)
revision - revision number of the record
last_modified - last modified timestamp
JSON - the complete record in JSON format
zcat ol_dump_works_latest.txt.gz | cut -f5 | head | jq .
We are going to want:
- title (with "prefix"?)
- authors
- subtitle
- year
- identifier (work? edition?)
- isbn-13 (if available)
- borrowable or not
## SOLR export
One time export: <https://archive.org/details/olsolr8-2021-04-12>
Start OL/SOLR, then export to jsonl:
```
$ time solrdump -rows 10000 -verbose -sort "key asc" \
-server http://localhost:8983/solr/openlibrary | \
jq -rc . | zstd -c9 -T0 > ol.jsonl.zst
```
* 35842305 docs
```
24438138 work
8425773 author
2978394 subject
```
|