aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/openlibrary_works.md
blob: 8f3e7519065927382b459b8a629a9ece4558b94c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

## Upstream Dumps

Open Library does monthly bulk dumps: <https://archive.org/details/ol_exports?sort=-publicdate>

Latest work dump: <https://openlibrary.org/data/ol_dump_works_latest.txt.gz>

TSV columns:

    type - type of record (/type/edition, /type/work etc.)
    key - unique key of the record. (/books/OL1M etc.)
    revision - revision number of the record
    last_modified - last modified timestamp
    JSON - the complete record in JSON format

    zcat ol_dump_works_latest.txt.gz | cut -f5 | head | jq .

We are going to want:

- title (with "prefix"?)
- authors
- subtitle
- year
- identifier (work? edition?)
- isbn-13 (if available)
- borrowable or not

## SOLR export

One time export: <https://archive.org/details/olsolr8-2021-04-12>

Start OL/SOLR, then export to jsonl:

```
$ time solrdump -rows 10000 -verbose -sort "key asc" \
    -server http://localhost:8983/solr/openlibrary | \
    jq -rc . | zstd -c9 -T0 > ol.jsonl.zst
```

* 35842305 docs

```
24438138 work
8425773 author
2978394 subject
```