proposals/2021-09-09_dataset_ingest.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239


Dataset Ingest Pipeline
=======================

Sandcrawler currently has ingest support for individual files saved as `file`
entities in fatcat (xml and pdf ingest types) and HTML files with
sub-components saved as `webcapture` entities in fatcat (html ingest type).

This document describes extensions to this ingest system to flexibly support
groups of files, which may be represented in fatcat as `fileset` entities. The
new ingest type is `dataset`.

Compared to the existing ingest process, there are two major complications with
datasets:

- the ingest process often requires more than parsing HTML files, and will be
  specific to individual platforms and host software packages
- the storage backend and fatcat entity type is flexible: a dataset might be
  represented by a single file, multiple files combined in to a single .zip
  file, or mulitple separate files; the data may get archived in wayback or in
  an archive.org item

The new concepts of "strategy" and "platform" are introduced to accomodate
these complications.


## Ingest Strategies

The ingest strategy describes the fatcat entity type that will be output; the
storage backend used; and whether an enclosing file format is used. The
strategy to use can not be determined until the number and size of files is
known. It is a function of file count, total file size, and platform.

Strategy names are compact strings with the format
`{storage_backend}-{fatcat_entity}`. A `-bundled` suffix after a `fileset`
entity type indicates that metadata about multiple files is retained, but that
in the storage backend only a single enclosing file (eg, `.zip`) will be
stored.

The supported strategies are:

- `web-file`: single file of any type, stored in wayback, represented as fatcat `file`
- `web-fileset`: multiple files of any type, stored in wayback, represented as fatcat `fileset`
- `web-fileset-bundled`: single bundle file, stored in wayback, represented as fatcat `fileset`
- `archiveorg-file`: single file of any type, stored in archive.org item, represented as fatcat `file`
- `archiveorg-fileset`: multiple files of any type, stored in archive.org item, represented as fatcat `fileset`
- `archiveorg-fileset-bundled`: single bundle file, stored in archive.org item, represented as fatcat `fileset`

"Bundle" files are things like .zip or .tar.gz. Not all .zip files are handled
as bundles! Only when the transfer from the hosting platform is via a "download
all as .zip" (or similar) do we consider a zipfile a "bundle" and index the
interior files as a fileset.

The term "bundle file" is used over "archive file" or "container file" to
prevent confusion with the other use of those terms in the context of fatcat
(container entities; archive; Internet Archive as an organiztion).

The motivation for supporting both `web` and `archiveorg` is that `web` is
somewhat simpler for small files, but `archiveorg` is better for larger groups
of files (say more than 20) and larger total size (say more than 1 GByte total,
or 128 MByte for any one file).

The motivation for supporting "bundled" filesets is that there is only a single
file to archive.


## Ingest Pseudocode

1. Determine `platform`, which may involve resolving redirects and crawling a landing page.

  a. TODO: do we always try crawling `base_url`? would simplify code flow, but results in extra SPN calls (slow). start with yes, always
  b. TODO: what if we trivially crawl directly to a non-HTML file? Bypass most of the below? `direct-file` strategy?
  c. `infer_platform(request, terminal_url, html_biblio)`

2. Use platform-specific methods to fetch manifest metadata and decide on an `ingest_strategy`.

3. Use strategy-specific methods to archive all files in platform manifest, and verify manifest metadata.

4. Summarize status and return structured result metadata.

Python APIs, as abstract classes (TODO):

    PlatformDatasetContext
        platform_name
        platform_domain
        platform_id
        manifest
        archiveorg_metadata
        web_base_url
    DatasetPlatformHelper
        match_request(request: Request, resource: Resource, html_biblio: Optional[BiblioMetadata]) -> bool
        process_request(?) -> ?
    StrategyArchiver
        process(manifest, archiveorg_metadata, web_metadata) -> ?
        check_existing(?) -> ?


## New Sandcrawler Code and Worker

    sandcrawler-ingest-fileset-worker@{1..12}

Worker consumes from ingest request topic, produces to fileset ingest results,
and optionally produces to file ingest results.

    sandcrawler-persist-ingest-fileset-worker@1

Simply writes fileset ingest rows in to SQL.

## New Fatcat Worker and Code Changes

    fatcat-import-ingest-fileset-worker

This importer should be modeled on file and web worker. Filters for `success`
with strategy of `*-fileset*`.

Existing `fatcat-import-ingest-file-worker` should be updated to allow
`dataset` single-file imports, with largely same behavior and semantics as
current importer.

TODO: Existing fatcat transforms, and possibly even elasticsearch schemas,
should be updated to include fileset status and `in_ia` flag for dataset type
releases.

TODO: Existing entity updates worker submits `dataset` type ingests to ingest
request topic.


## New SQL Tables

    CREATE TABLE IF NOT EXISTS ingest_fileset_result (
        ingest_type             TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
        base_url                TEXT NOT NULL CHECK (octet_length(base_url) >= 1),
        updated                 TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        hit                     BOOLEAN NOT NULL,
        status                  TEXT CHECK (octet_length(status) >= 1),

        terminal_url            TEXT CHECK (octet_length(terminal_url) >= 1),
        terminal_dt             TEXT CHECK (octet_length(terminal_dt) = 14),
        terminal_status_code    INT,
        terminal_sha1hex        TEXT CHECK (octet_length(terminal_sha1hex) = 40),

        platform                TEXT CHECK (octet_length(platform) >= 1),
        platform_domain         TEXT CHECK (octet_length(platform_domain) >= 1),
        platform_id             TEXT CHECK (octet_length(platform_id) >= 1),
        ingest_strategy         TEXT CHECK (octet_length(ingest_strategy) >= 1),
        total_size              BIGINT,
        file_count              INT,
        item_name               TEXT CHECK (octet_length(item_name) >= 1),
        item_bundle_path        TEXT CHECK (octet_length(item_path_bundle) >= 1),

        manifest                JSONB,
        -- list, similar to fatcat fileset manifest, plus extra:
        --   status (str)
        --   path (str)
        --   size (int)
        --   md5 (str)
        --   sha1 (str)
        --   sha256 (str)
        --   mimetype (str)
        --   platform_url (str)
        --   terminal_url (str)
        --   terminal_dt (str)
        --   extra (dict) (?)

        PRIMARY KEY (ingest_type, base_url)
    );
    CREATE INDEX ingest_fileset_result_terminal_url_idx ON ingest_fileset_result(terminal_url);


## New Kafka Topic and JSON Schema

    
    sandcrawler-ENV.ingest-fileset-results 6x, no retention limit


## Implementation Plan

First implement ingest worker, including platform and strategy helpers, and
test those as simple stdin/stdout CLI tools in sandcrawler repo to validate
this proposal.

Second implement fatcat importer and test locally and/or in QA.

Lastly implement infrastructure, automation, and other "glue".


## Example Entities

### ArchiveOrg: CAT dataset

<https://archive.org/details/CAT_DATASET>

`release_36vy7s5gtba67fmyxlmijpsaui`

###

<https://archive.org/details/academictorrents_70e0794e2292fc051a13f05ea6f5b6c16f3d3635>

doi:10.1371/journal.pone.0120448

Single .rar file

### Dataverse

<https://dataverse.rsu.lv/dataset.xhtml?persistentId=doi:10.48510/FK2/IJO02B>

Single excel file

### Dataverse

<https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CLSFKX&version=1.1>

doi:10.7910/DVN/CLSFKX

Mulitple files; multiple versions?

API fetch: <https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId=doi:10.7910/DVN/CLSFKX&version=1.1>

    .data.id
    .data.latestVersion.datasetPersistentId
    .data.latestVersion.versionNumber, .versionMinorNumber
    .data.latestVersion.files[]
        .dataFile
            .contentType (mimetype)
            .filename
            .filesize (int, bytes)
            .md5
            .persistendId
            .description
        .label (filename?)
        .version

Single file inside: <https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/CLSFKX/XWEHBB>

Download single file: <https://dataverse.harvard.edu/api/access/datafile/:persistentId/?persistentId=doi:10.7910/DVN/CLSFKX/XWEHBB> (redirects to AWS S3)

Dataverse refs:
- 'doi' and 'hdl' are the two persistentId styles
- file-level persistentIds are optional, on a per-instance basis: https://guides.dataverse.org/en/latest/installation/config.html#filepidsenabled