proposals/2021-09-09_dataset_ingest.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185


Dataset Ingest Pipeline
=======================

Sandcrawler currently has ingest support for individual files saved as `file`
entities in fatcat (xml and pdf ingest types) and HTML files with
sub-components saved as `webcapture` entities in fatcat (html ingest type).

This document describes extensions to this ingest system to flexibly support
groups of files, which may be represented in fatcat as `fileset` entities. The
new ingest type is `dataset`.

Compared to the existing ingest process, there are two major complications with
datasets:

- the ingest process often requires more than parsing HTML files, and will be
  specific to individual platforms and host software packages
- the storage backend and fatcat entity type is flexible: a dataset might be
  represented by a single file, multiple files combined in to a single .zip
  file, or mulitple separate files; the data may get archived in wayback or in
  an archive.org item

The new concepts of "strategy" and "platform" are introduced to accomodate
these complications.


## Ingest Strategies

The ingest strategy describes the fatcat entity type that will be output; the
storage backend used; and whether an enclosing file format is used. The
strategy to use can not be determined until the number and size of files is
known. It is a function of file count, total file size, and platform.

Strategy names are compact strings with the format
`{storage_backend}-{fatcat_entity}`. A `-bundled` suffix after a `fileset`
entity type indicates that metadata about multiple files is retained, but that
in the storage backend only a single enclosing file (eg, `.zip`) will be
stored.

The supported strategies are:

- `web-file`: single file of any type, stored in wayback, represented as fatcat `file`
- `web-fileset`: multiple files of any type, stored in wayback, represented as fatcat `fileset`
- `web-fileset-bundled`: single bundle file, stored in wayback, represented as fatcat `fileset`
- `archiveorg-file`: single file of any type, stored in archive.org item, represented as fatcat `file`
- `archiveorg-fileset`: multiple files of any type, stored in archive.org item, represented as fatcat `fileset`
- `archiveorg-fileset-bundled`: single bundle file, stored in archive.org item, represented as fatcat `fileset`

"Bundle" files are things like .zip or .tar.gz. Not all .zip files are handled
as bundles! Only when the transfer from the hosting platform is via a "download
all as .zip" (or similar) do we consider a zipfile a "bundle" and index the
interior files as a fileset.

The term "bundle file" is used over "archive file" or "container file" to
prevent confusion with the other use of those terms in the context of fatcat
(container entities; archive; Internet Archive as an organiztion).

The motivation for supporting both `web` and `archiveorg` is that `web` is
somewhat simpler for small files, but `archiveorg` is better for larger groups
of files (say more than 20) and larger total size (say more than 1 GByte total,
or 128 MByte for any one file).

The motivation for supporting "bundled" filesets is that there is only a single
file to archive.


## Ingest Pseudocode

1. Determine `platform`, which may involve resolving redirects and crawling a landing page.

  a. TODO: do we always try crawling `base_url`? would simplify code flow, but results in extra SPN calls (slow). start with yes, always
  b. TODO: what if we trivially crawl directly to a non-HTML file? Bypass most of the below? `direct-file` strategy?
  c. `infer_platform(request, terminal_url, html_biblio)`

2. Use platform-specific methods to fetch manifest metadata and decide on an `ingest_strategy`.

3. Use strategy-specific methods to archive all files in platform manifest, and verify manifest metadata.

4. Summarize status and return structured result metadata.

Python APIs, as abstract classes (TODO):

    PlatformDatasetContext
        platform_name
        platform_domain
        platform_id
        manifest
        archiveorg_metadata
        web_base_url
    DatasetPlatformHelper
        match_request(request: Request, resource: Resource, html_biblio: Optional[BiblioMetadata]) -> bool
        process_request(?) -> ?
    StrategyArchiver
        process(manifest, archiveorg_metadata, web_metadata) -> ?
        check_existing(?) -> ?


## New Sandcrawler Code and Worker

    sandcrawler-ingest-fileset-worker@{1..12}

Worker consumes from ingest request topic, produces to fileset ingest results,
and optionally produces to file ingest results.

    sandcrawler-persist-ingest-fileset-worker@1

Simply writes fileset ingest rows in to SQL.

## New Fatcat Worker and Code Changes

    fatcat-import-ingest-fileset-worker

This importer should be modeled on file and web worker. Filters for `success`
with strategy of `*-fileset*`.

Existing `fatcat-import-ingest-file-worker` should be updated to allow
`dataset` single-file imports, with largely same behavior and semantics as
current importer.

TODO: Existing fatcat transforms, and possibly even elasticsearch schemas,
should be updated to include fileset status and `in_ia` flag for dataset type
releases.

TODO: Existing entity updates worker submits `dataset` type ingests to ingest
request topic.


## New SQL Tables

    CREATE TABLE IF NOT EXISTS ingest_fileset_result (
        ingest_type             TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
        base_url                TEXT NOT NULL CHECK (octet_length(base_url) >= 1),
        updated                 TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        hit                     BOOLEAN NOT NULL,
        status                  TEXT CHECK (octet_length(status) >= 1),

        terminal_url            TEXT CHECK (octet_length(terminal_url) >= 1),
        terminal_dt             TEXT CHECK (octet_length(terminal_dt) = 14),
        terminal_status_code    INT,
        terminal_sha1hex        TEXT CHECK (octet_length(terminal_sha1hex) = 40),

        platform                TEXT CHECK (octet_length(platform) >= 1),
        platform_domain         TEXT CHECK (octet_length(platform_domain) >= 1),
        platform_id             TEXT CHECK (octet_length(platform_id) >= 1),
        ingest_strategy         TEXT CHECK (octet_length(ingest_strategy) >= 1),
        total_size              BIGINT,
        file_count              INT,
        item_name               TEXT CHECK (octet_length(item_name) >= 1),
        item_bundle_path        TEXT CHECK (octet_length(item_path_bundle) >= 1),

        manifest                JSONB,
        -- list, similar to fatcat fileset manifest, plus extra:
        --   status (str)
        --   path (str)
        --   size (int)
        --   md5 (str)
        --   sha1 (str)
        --   sha256 (str)
        --   mimetype (str)
        --   platform_url (str)
        --   terminal_url (str)
        --   terminal_dt (str)
        --   extra (dict) (?)

        PRIMARY KEY (ingest_type, base_url)
    );
    CREATE INDEX ingest_fileset_result_terminal_url_idx ON ingest_fileset_result(terminal_url);


## New Kafka Topic and JSON Schema

    
    sandcrawler-ENV.ingest-fileset-results 6x, no retention limit


## Implementation Plan

First implement ingest worker, including platform and strategy helpers, and
test those as simple stdin/stdout CLI tools in sandcrawler repo to validate
this proposal.

Second implement fatcat importer and test locally and/or in QA.

Lastly implement infrastructure, automation, and other "glue".