1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
Grouped Release Exports
=======================
Want to have rich "work" entity dumps, without duplicating with enriched
release dumps. The motivation for this is to do work-level processing of
releases. For example, transform into scholar.archive.org (indexed at work
level), bulk downloading fulltext (only need "best copy for entire work"),
citation analysis, etc.
One elegant way to do this would be having release exports be sorted by
`work_id`, with all release entities with the same `work_id` contiguous (but
not all on the same line).
Plan is to update fatcat-export tool (rust) to read in an ident list with
`work_id` included (and sorted by the `work_id`).
## Rust Implementation
Remembering that `fatcat-export` operates on lines, and uses channels and
worker threads for processing and serialization, we can't just assume that the
output lines will be in the same order as the input lines (eg, worker threads
can shift order). To preserve order, we will add a new dump mode which operates
on batches (`Vec`) of rows instead, and write the batches contiguously. The
batches themselves may end up out of order, but that is fine.
## SQL Ident Dump
Database timing (running 2020-08-04 on fatcat-qa):
COPY (SELECT id, rev_id, redirect_id FROM release_ident WHERE is_live=true) TO '/tmp/fatcat_ident_releases.tsv' WITH NULL '';
=> COPY 142635153
=> Time: 143264.483 ms (02:23.264)
COPY (
SELECT
release_ident.id,
release_ident.rev_id,
release_ident.redirect_id,
release_rev.work_ident_id
FROM
release_ident
LEFT JOIN release_rev ON release_ident.rev_id = release_rev.id
WHERE
release_ident.is_live=true
AND release_ident.redirect_id IS NULL
AND release_ident.rev_id IS NOT NULL
ORDER BY
release_rev.work_ident_id ASC NULLS LAST
) TO '/tmp/fatcat_ident_releases_by_work.tsv' WITH NULL '';
=> COPY 142635147
=> Time: 610540.112 ms (10:10.540)
Much slower, but not a blocking issue. Apparently postgresql will do a full
rowscan instead of using the existing `work_id` sorted index on `release_rev`
when going of a large fraction of the table (here almost the entire table)
because it thinks the read will be so much faster. I don't think this is true
with SSDs? Haven't actually confirmed this is the behavior, just assuming based
on postgresql docs.
|