aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2021-12-13_datasets.md
blob: edad78935a7e60dbda63200aa8b9f10e6f21a021 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398

First round of production dataset ingest. Aiming to get one or two small
repositories entirely covered, and a few thousand datasets from all supported
platforms.

Planning to run with sandcrawler in batch mode on `wbgrp-svc263`, expecting up
to a TByte of content locally (on spinning disk). For successful output, will
run through fatcat import; for a subset of unsuccessful, will start a small
heritrix crawl.


## Ingest Generation

Summary:

    wc -l /srv/fatcat/tasks/ingest_dataset_*pilot.json
          2 /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json
       1702 /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json
       2975 /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json
      10000 /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json
      10000 /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json

All the below ingest requests were combined into a single large file:

    cat /srv/fatcat/tasks/ingest_dataset*pilot.json | shuf | pv -l | gzip > /srv/fatcat/tasks/ingest_dataset_combined.json.gz
    # 24.7k 0:00:00 [91.9k/s]

### Figshare

- sample 10k datasets (not other types)
- want only "versioned" DOIs; use regex on DOI to ensure

    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.6084 type:dataset' \
        | rg '10\.6084/m9\.figshare\.\d+.v\d+' \
        | shuf -n10000 \
        | pv -l \
        > /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json
    # Counter({'estimate': 505968, 'ingest_request': 50000, 'elasticsearch_release': 50000})

### Zenodo

- has DOIs (of course)
- want only "versioned" DOIs? how to skip?
- sample 10k

    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.5281 type:dataset' \
        | rg '10\.5281/zenodo' \
        | shuf -n10000 \
        | pv -l \
        > /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json

### Goettingen Research Online

- <https://data.goettingen-research-online.de/>
- Dataverse instance, not harvard-hosted
- ~1,400 datasets, ~10,500 files
- has DOIs
- `doi_prefix:10.25625`, then filter to only one slash

    ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query 'doi_prefix:10.25625 type:dataset' \
        | rg -v '10\.25625/[a-z0-9]+/[a-z0-9]' \
        | shuf \
        | pv -l \
        > /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json
    # Counter({'ingest_request': 12739, 'elasticsearch_release': 12739, 'estimate': 12739})                                                                       # 1.7k 0:01:29 [  19 /s]

### Harvard Dataverse

- main harvard dataverse instance, many "sub-dataverses"
- ~137,000 datasets, ~1,400,000 files
- 10k sample

    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.7910 type:dataset' \
        | rg '10\.7910/dvn/[a-z0-9]{6}' \
        | rg -v '10\.7910/dvn/[a-z0-9]{6}/[a-z0-9]' \
        | shuf -n10000 \
        | pv -l \
        > /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json
    # Counter({'estimate': 660979, 'ingest_request': 50000, 'elasticsearch_release': 50000})                                                                      # 2.97k 0:03:26 [14.4 /s]

Note that this was fewer than expected, but moving on anyways.

### archive.org

A couple hand-filtered items.

"CAT" dataset
- item: <https://archive.org/details/CAT_DATASET>
- fatcat release (for paper): `release_36vy7s5gtba67fmyxlmijpsaui`

"The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing"
- https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62
- https://fatcat.wiki/release/7owybd2hrvdmdpm4zpo7hkn2pu (paper)


    {
        "ingest_type": "dataset",
        "ingest_request_source": "savepapernow",
        "base_url": "https://archive.org/details/CAT_DATASET",
        "release_stage": "published",
        "fatcat": {
            "release_ident": "36vy7s5gtba67fmyxlmijpsaui",
            "work_ident": "ycqtbhnfmzamheq2amztiwbsri"
        },
        "ext_ids": {},
        "link_source": "spn",
        "link_source_id": "36vy7s5gtba67fmyxlmijpsaui"
    }
    {
        "ingest_type": "dataset",
        "ingest_request_source": "savepapernow",
        "base_url": "https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62",
        "release_stage": "published",
        "fatcat": {
            "release_ident": "7owybd2hrvdmdpm4zpo7hkn2pu",
            "work_ident": "3xkz7iffwbdfhbwhnd73iu66cu"
        },
        "ext_ids": {},
        "link_source": "spn",
        "link_source_id": "7owybd2hrvdmdpm4zpo7hkn2pu"
    }

    # paste and then Ctrl-D:
    cat | jq . -c > /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json


## Ingest Command

On `wbgrp-svc263`.

In the current version of tool, `skip_cleanup_local_files=True` by default, so
files will stick around.

Note that `--no-spn2` is passed, so we are expecting a lot of `no-capture` in the output.


    # first a small sample
    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | head -n5 \
        | pv -l \
        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results.ramp.json

    # ok, run the whole batch through
    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | pv -l \
        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results.json

Got an error:

    internetarchive.exceptions.AuthenticationError: No access_key or secret_key set! Have you run `ia configure`?

Did a hot patch to try to have the uploads happen under a session, with config from ENV, but didn't work:

    AttributeError: 'ArchiveSession' object has no attribute 'upload'

Going to hack with config in homedir for now.

Extract URLs for crawling:

    cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \
        | rg '"no-capture"' \
        | rg -v '"manifest"' \
        | jq 'select(.status = "no-capture")' -c \
        | jq .request.base_url -r \
        | pv -l \
        > /srv/sandcrawler/tasks/dataset_seedlist.base_url.txt

    cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \
        | rg '"no-capture"' \
        | rg '"manifest"' \
        | jq 'select(.status = "no-capture")' -c \
        | rg '"web-' \
        | jq .manifest[].terminal_url -r \
        | pv -l \
        > /srv/sandcrawler/tasks/dataset_seedlist.manifest_terminal.txt

### Exceptions Encountered

    File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 193, in process
        internetarchive.upload
    [...]
    ConnectionResetError: [Errno 104] Connection reset by peer
    urllib3.exceptions.ProtocolError
    requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), 'https://s3.us.archive.org/zenodo.org-3275525/rhOverM_Asymptotic_GeometricUnits_CoM.h5')


    Traceback (most recent call last):
      File "./ingest_tool.py", line 208, in <module>
        main()
      File "./ingest_tool.py", line 204, in main
        args.func(args)
      File "./ingest_tool.py", line 57, in run_requests
        result = fileset_worker.process(request)
      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 375, in process
        archive_result = strategy_helper.process(dataset_meta)
      File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 130, in process
        r.raise_for_status()
      File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status  
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ndownloader.figshare.com/files/5474201

download sometimes just slowly time out, like after a day or more


    Traceback (most recent call last):
      File "./ingest_tool.py", line 208, in <module>
        main()
      File "./ingest_tool.py", line 204, in main
        args.func(args)
      File "./ingest_tool.py", line 57, in run_requests
        result = fileset_worker.process(request)
      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 381, in process
        archive_result = strategy_helper.process(dataset_meta)
      File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 155, in process
        file_meta = gen_file_metadata_path(local_path, allow_empty=True)
      File "/srv/sandcrawler/src/python/sandcrawler/misc.py", line 89, in gen_file_metadata_path
        mimetype = magic.Magic(mime=True).from_file(path)
      File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/magic/__init__.py", line 111, in from_file
        with _real_open(filename):
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sandcrawler/figshare.com-7925396-v1/HG02070.dedup.realigned.recalibrated.hc.g.vcf.gz'


    Traceback (most recent call last):
      File "./ingest_tool.py", line 208, in <module>
        main()
      File "./ingest_tool.py", line 204, in main
        args.func(args)
      File "./ingest_tool.py", line 57, in run_requests
        result = fileset_worker.process(request)
      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 314, in process
        dataset_meta = platform_helper.process_request(request, resource, html_biblio)
      File "/srv/sandcrawler/src/python/sandcrawler/fileset_platforms.py", line 208, in process_request
        obj_latest = obj["data"]["latestVersion"]
    KeyError: 'latestVersion'

Fixed the above, trying again:

    git log | head -n1
    # commit ffdc901fa067db55fe6cfeb8d0c3807d29df092c

    Wed Dec 15 21:57:42 UTC 2021

    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | shuf \
        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
        | pv -l \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results4.json

Zenodo seems really slow, let's try filtering those out:

    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | rg -v 10.5281 \
        | shuf \
        | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
        | pv -l \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results5.json
    # 3.76k 15:12:53 [68.7m/s]

    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | rg -v 10.5281 \
        | shuf \
        | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
        | pv -l \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results6.json

## Fatcat Import

    wc -l ingest_dataset_combined_results*.json
         126 ingest_dataset_combined_results2.json
         153 ingest_dataset_combined_results3.json
         275 ingest_dataset_combined_results4.json
        3762 ingest_dataset_combined_results5.json
        7736 ingest_dataset_combined_results6.json
         182 ingest_dataset_combined_results.json
           5 ingest_dataset_combined_results.ramp.json
       12239 total

    cat ingest_dataset_combined_results*.json \
        | rg '^\{' \
        | jq '[.request.fatcat.release_ident, . | tostring] | @tsv' -r \
        | sort \
        | uniq --check-chars 26 \
        | cut -f2 \
        | rg -v '\\\\' \
        | pv -l \
        > uniq_ingest_dataset_combined_results.json
    # 9.48k 0:00:06 [1.54k/s]

    cat uniq_ingest_dataset_combined_results.json | jq .status -r | sort | uniq -c | sort -nr
       7941 no-capture
        374 platform-404
        369 terminal-bad-status
        348 success-file
        172 success
         79 platform-scope
         77 error-platform-download
         47 empty-manifest
         27 platform-restricted
         20 too-many-files
         12 redirect-loop
          6 error-archiveorg-upload
          3 too-large-size
          3 mismatch
          1 no-platform-match

    cat uniq_ingest_dataset_combined_results.json \
        | rg '"success' \
        | jq 'select(.status == "success") | .' -c \
        > uniq_ingest_dataset_combined_results.success.json

    cat uniq_ingest_dataset_combined_results.json \
        | rg '"success' \
        | jq 'select(.status == "success-file") | .' -c \
        > uniq_ingest_dataset_combined_results.success-file.json

On fatcat QA instance:

    git log | head -n1
    # commit cca680e2cc4768a4d45e199f6256a433b25b4075

    head /tmp/uniq_ingest_dataset_combined_results.success-file.json \
        | ./fatcat_import.py ingest-fileset-results -
    # Counter({'total': 10, 'skip': 10, 'skip-single-file': 10, 'insert': 0, 'update': 0, 'exists': 0})

    head /tmp/uniq_ingest_dataset_combined_results.success-file.json \
        | ./fatcat_import.py ingest-file-results -
    # Counter({'total': 10, 'skip': 10, 'skip-ingest-type': 10, 'insert': 0, 'update': 0, 'exists': 0})

Need to update fatcat file worker to support single-file filesets... was that the plan?

    head /tmp/uniq_ingest_dataset_combined_results.success.json \
        | ./fatcat_import.py ingest-fileset-results -
    # Counter({'total': 10, 'skip': 10, 'skip-no-access-url': 10, 'insert': 0, 'update': 0, 'exists': 0})

    # Counter({'total': 10, 'insert': 10, 'skip': 0, 'update': 0, 'exists': 0})


## Summary

As follow-up, it may be worth doing another manual round of ingest requests.
After that, would be good to fill in "glue" code so that this can be done with
kafka workers, and do re-tries/dumps using sandcrawler SQL database. Then can
start scaling up more ingest, using ingest tool, "bulk mode" processing,
heritrix crawls from `no-capture` dumps, etc, similar to bulk file ingest
process.

For scaling, let's do a "full" ingest request generation of all datasets, and
crawl the base URL with heritrix, in fast/direct mode. Expect this to be tens
of millions of mostly DOIs (doi.org URLs), should crawl quickly.

Then, do bulk downloading with ingest worker, perhaps on misc-vm or aitio.
uploading large datasets to archive.org, but not doing SPN web requests. Feed
the resulting huge file seedlist into a heritrix crawl to download web files.

Will need to add support for more specific platforms.


### Huge Bulk Ingest Prep

On prod instance:

    ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query type:dataset \
        | pv -l \
        | gzip \
        > /srv/fatcat/tasks/ingest_dataset_bulk.2022-01-05.json.gz
    # Expecting 11264787 release objects in search queries
    # TIMEOUT ERROR
    # 6.07M 19:13:02 [87.7 /s] (partial)

As follow-up, should do a full batch (not partial). For now search index is too
unreliable (read timeouts).

    zcat ingest_dataset_bulk.2022-01-05.partial.json.gz \
        | jq .base_url -r \
        | sort -u \
        | shuf \
        | awk '{print "F+ " $1}' \
        > ingest_dataset_bulk.2022-01-05.partial.schedule

## Retries (2022-01-12)

This is after having done a bunch of crawling.

    cat ingest_dataset_combined_results6.json \
        | rg '"no-capture"' \
        | jq 'select(.status = "no-capture")' -c \
        | jq .request -c \
        | pv -l \
        > ingest_dataset_retry.json
    => 6.51k 0:00:01 [3.55k/s]

    cat /srv/sandcrawler/tasks/ingest_dataset_retry.json \
        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
        | pv -l \
        > /srv/sandcrawler/tasks/ingest_dataset_retry_results.json