aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2021-12-09_trawling.md
blob: 33b6b4c8a0c2065ddd22db4715101401a7c7dd37 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180

status: work-in-progress

NOTE: as of December 2022, the implementation on these features haven't been
merged to the main branch. Development stalled in December 2021.

Trawling for Unstructured Scholarly Web Content
===============================================

## Background and Motivation

A long-term goal for sandcrawler has been the ability to pick through
unstructured web archive content (or even non-web collection), identify
potential in-scope research outputs, extract metadata for those outputs, and
merge the content in to a catalog (fatcat).

This process requires integration of many existing tools (HTML and PDF
extraction; fuzzy bibliographic metadata matching; machine learning to identify
in-scope content; etc), as well as high-level curration, targetting, and
evaluation by human operators. The goal is to augment and improve the
productivity of human operators as much as possible.

This process will be similar to "ingest", which is where we start with a
specific URL and have some additional context about the expected result (eg,
content type, exernal identifier). Some differences with trawling are that we
are start with a collection or context (instead of single URL); have little or
no context about the content we are looking for; and may even be creating a new
catalog entry, as opposed to matching to a known existing entry.


## Architecture

The core operation is to take a resource and run a flowchart of processing
steps on it, resulting in an overall status and possible related metadata. The
common case is that the resource is a PDF or HTML coming from wayback (with
contextual metadata about the capture), but we should be flexible to supporting
more content types in the future, and should try to support plain files with no
context as well.

Some relatively simple wrapper code handles fetching resources and summarizing
status/counts.

Outside of the scope of sandcrawler, new fatcat code (importer or similar) will
be needed to handle trawl results. It will probably make sense to pre-filter
(with `jq` or `rg`) before passing results to fatcat.

At this stage, trawl workers will probably be run manually. Some successful
outputs (like GROBID, HTML metadata) would be written to existing kafka topics
to be persisted, but there would not be any specific `trawl` SQL tables or
automation.

It will probably be helpful to have some kind of wrapper script that can run
sandcrawler trawl processes, then filter and pipe the output into fatcat
importer, all from a single invocation, while reporting results.

TODO:
- for HTML imports, do we fetch the full webcapture stuff and return that?


## Methods of Operation

### `cdx_file`

An existing CDX file is provided on-disk locally.

### `cdx_api`

Simplified variants: `cdx_domain`, `cdx_surt`

Uses CDX API to download records matching the configured filters, then processes the file.

Saves the CDX file intermediate result somewhere locally (working or tmp
directory), with timestamp in the path, to make re-trying with `cdx_file` fast
and easy.


### `archiveorg_web_collection`

Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over
then process it.

Saves the CDX file intermediate result somewhere locally (working or tmp
directory), with timestamp in the path, to make re-trying with `cdx_file` fast
and easy.

### Others

- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each

## Schema

Per-resource results:

    hit (bool)
        indicates whether resource seems in scope and was processed successfully
        (roughly, status 'success', and 
    status (str)
        success: fetched resource, ran processing, pa
        skip-cdx: filtered before even fetching resource
        skip-resource: filtered after fetching resource
        wayback-error (etc): problem fetching
    content_scope (str)
        filtered-{filtertype}
        article (etc)
        landing-page
    resource_type (str)
        pdf, html
    file_meta{}
    cdx{}
    revisit_cdx{}

    # below are resource_type specific
    grobid
    pdf_meta
    pdf_trio
    html_biblio
    (other heuristics and ML)

High-level request:

    trawl_method: str
    cdx_file_path
    default_filters: bool
    resource_filters[]
        scope: str
            surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status
        value: any
        values[]: any
        min: any
        max: any
    biblio_context{}: set of expected/default values
        container_id
        release_type
        release_stage
        url_rel

High-level summary / results:

    status
    request{}: the entire request object
    counts
        total_resources
        status{}
        content_scope{}
        resource_type{}

## Example Corpuses

All PDFs (`application/pdf`) in web.archive.org from before the year 2000.
Starting point would be a CDX list.

Spidering crawls starting from a set of OA journal homepage URLs.

Archive-It partner collections from research universities, particularly of
their own .edu domains. Starting point would be an archive.org collection, from
which WARC files or CDX lists can be accessed.

General archive.org PDF collections, such as
[ERIC](https://archive.org/details/ericarchive) or
[Document Cloud](https://archive.org/details/documentcloud).

Specific Journal or Publisher URL patterns. Starting point could be a domain,
hostname, SURT prefix, and/or URL regex.

Heuristic patterns over full web.archive.org CDX index. For example, .edu
domains with user directories and a `.pdf` in the file path ("tilde" username
pattern).

Random samples of entire Wayback corpus. For example, random samples filtered
by date, content type, TLD, etc. This would be true "trawling" over the entire
corpus.


## Other Ideas

Could have a web archive spidering mode: starting from a seed, fetch multiple
captures (different captures), then extract outlinks from those, up to some
number of hops. An example application would be links to research group
webpages or author homepages, and to try to extract PDF links from CVs, etc.