1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
_
| |
__, ,_ __, | | _ , __, _
/ | / | / | |/ \_|/ / \_/ | | | |/
\_/|_/ |_/\_/|_/\_/ |__/ \/ \_/|_/ \_/|_/|__/
|\
|/
A simple python3 script to summarize Heritrix3 web crawl logs for a particular
style of crawl: fetching large numbers of files associated with a persistent
identifier. For example, crawling tens of millions of Open Access PDFs (via
direct link or landing page URL) associated with a DOI.
Output is a (large) sqlite3 database file. Combine with `sqlite-notebook` to
generate HTML reports:
https://github.com/bnewbold/sqlite-notebook
The simplest usage is to specify a seed-url/identifier mapping, a crawl log,
and an output database file name:
./arabesque.py everything examples/crawl.log examples/seed_doi.tsv output.sqlite3
Then generate an HTML report:
sqlite-notebook.py examples/report_template.md output.sqlite3 > report.html
The core feature of this script to is resolve HTTP redirect chains. In the
"backward" mode, all terminal responses (HTTP 200) that are in-scope (by
mimetype) are resolved back to their original seed URL. There may be multiple
in-scope terminal responses per seed (eg, via embeds or other URL extraction
beans). In the "forward" mode, redirects are resolved to a single terminal
response (if there is one), which may be 4xx, 5xx, or other failure response
code.
The result is a single summary table with the following SQL schema:
CREATE TABLE IF NOT EXISTS crawl_result
(initial_url text NOT NULL,
identifier text,
initial_domain text,
breadcrumbs text,
final_url text,
final_domain text text,
final_timestamp text,
final_status_code text,
final_sha1 text,
final_mimetype text,
final_was_dedupe bool,
hit bool);
There aren't many tests, but what there is can be run with:
pytest-3 arabesque.py
|