From 4afefd1f2d5cfe71fa7a52ff89aec63af632c874 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 4 Aug 2020 18:21:54 -0700 Subject: proposal for work grouping --- proposals/20200804_grouped_release_exports.md | 60 +++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 proposals/20200804_grouped_release_exports.md (limited to 'proposals') diff --git a/proposals/20200804_grouped_release_exports.md b/proposals/20200804_grouped_release_exports.md new file mode 100644 index 00000000..d75ba687 --- /dev/null +++ b/proposals/20200804_grouped_release_exports.md @@ -0,0 +1,60 @@ + +Grouped Release Exports +======================= + +Want to have rich "work" entity dumps, without duplicating with enriched +release dumps. The motivation for this is to do work-level processing of +releases. For example, transform into scholar.archive.org (indexed at work +level), bulk downloading fulltext (only need "best copy for entire work"), +citation analysis, etc. + +One elegant way to do this would be having release exports be sorted by +`work_id`, with all release entities with the same `work_id` contiguous (but +not all on the same line). + +Plan is to update fatcat-export tool (rust) to read in an ident list with +`work_id` included (and sorted by the `work_id`). + +## Rust Implementation + +Remembering that `fatcat-export` operates on lines, and uses channels and +worker threads for processing and serialization, we can't just assume that the +output lines will be in the same order as the input lines (eg, worker threads +can shift order). To preserve order, we will add a new dump mode which operates +on batches (`Vec`) of rows instead, and write the batches contiguously. The +batches themselves may end up out of order, but that is fine. + +## SQL Ident Dump + + +Database timing (running 2020-08-04 on fatcat-qa): + + COPY (SELECT id, rev_id, redirect_id FROM release_ident WHERE is_live=true) TO '/tmp/fatcat_ident_releases.tsv' WITH NULL ''; + => COPY 142635153 + => Time: 143264.483 ms (02:23.264) + + COPY ( + SELECT + release_ident.id, + release_ident.rev_id, + release_ident.redirect_id, + release_rev.work_ident_id + FROM + release_ident + LEFT JOIN release_rev ON release_ident.rev_id = release_rev.id + WHERE + release_ident.is_live=true + AND release_ident.redirect_id IS NULL + AND release_ident.rev_id IS NOT NULL + ORDER BY + release_rev.work_ident_id ASC NULLS LAST + ) TO '/tmp/fatcat_ident_releases_by_work.tsv' WITH NULL ''; + => COPY 142635147 + => Time: 610540.112 ms (10:10.540) + +Much slower, but not a blocking issue. Apparently postgresql will do a full +rowscan instead of using the existing `work_id` sorted index on `release_rev` +when going of a large fraction of the table (here almost the entire table) +because it thinks the read will be so much faster. I don't think this is true +with SSDs? Haven't actually confirmed this is the behavior, just assuming based +on postgresql docs. -- cgit v1.2.3