This crawl report is auto-generated from a sqlite database file, which should be available/included.
identifiers
uris
domains
480
583
163
QUERY: SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT initial_domain) AS domains FROM crawl_result;
FTP seed URLs
ftp_urls
0
QUERY: SELECT COUNT(*) as ftp_urls FROM crawl_result WHERE initial_url LIKE 'ftp://%';
identifiers
uris
unique_sha1
63
166
166
QUERY: SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT final_sha1) as unique_sha1 FROM crawl_result WHERE hit=1;
De-duplication percentage (aka, fraction of hits where content had been crawled and identified previously):
percent
47.59036144578313
QUERY: SELECT 100. * AVG(final_was_dedupe) as percent FROM crawl_result WHERE hit=1;
Top mimetypes for successful hits (these are usually filtered to a fixed list in post-processing):
final_mimetype
COUNT(*)
application/pdf
161
application/octet-stream
5
QUERY: SELECT final_mimetype, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_mimetype ORDER BY COUNT(*) DESC LIMIT 10;
Most popular breadcrumbs (a measure of how hard the crawler had to work):
breadcrumbs
COUNT(*)
-
125
R
39
L
2
QUERY: SELECT breadcrumbs, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY breadcrumbs ORDER BY COUNT(*) DESC LIMIT 10;
FTP vs. HTTP hits (200 is HTTP, 226 is FTP):
final_status_code
COUNT(*)
200
166
QUERY: SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_status_code LIMIT 10;
Top initial domains:
initial_domain
COUNT(*)
percent
www.nature.com
22
3.7735849056603774
www.medicaljournals.se
21
3.6020583190394513
ajpgi.physiology.org
14
2.4013722126929675
jn.physiology.org
12
2.058319039451115
naukaru.ru
12
2.058319039451115
www.physiology.org
12
2.058319039451115
web.mit.edu
11
1.8867924528301887
www.nada.kth.se
11
1.8867924528301887
medicaljournals.se
10
1.7152658662092624
www.jstage.jst.go.jp
10
1.7152658662092624
www.site.uottawa.ca
10
1.7152658662092624
www.tandfonline.com
10
1.7152658662092624
academic.oup.com
9
1.5437392795883362
iopscience.iop.org
9
1.5437392795883362
www.amjbot.org
9
1.5437392795883362
www.efmaefm.org
9
1.5437392795883362
ajpcell.physiology.org
8
1.3722126929674099
ajpheart.physiology.org
8
1.3722126929674099
content.iospress.com
8
1.3722126929674099
link.springer.com
8
1.3722126929674099
QUERY: SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result) as percent FROM crawl_result GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20;
Top successful, final domains, where hits were found:
initial_domain
COUNT(*)
percent
www.physiology.org
12
7.228915662650603
www.jstage.jst.go.jp
10
6.024096385542169
content.iospress.com
8
4.819277108433735
digital.library.unt.edu
7
4.216867469879518
files.eccomasproceedia.org
7
4.216867469879518
link.springer.com
7
4.216867469879518
www.scielo.br
7
4.216867469879518
www.termedia.pl
7
4.216867469879518
ijpsr.com
6
3.6144578313253013
uvadoc.uva.es
6
3.6144578313253013
www.jafs.com.pl
6
3.6144578313253013
hal.archives-ouvertes.fr
5
3.0120481927710845
iopscience.iop.org
5
3.0120481927710845
www.cambridge.org
5
3.0120481927710845
digitool.library.mcgill.ca
4
2.4096385542168677
www.ejgm.co.uk
4
2.4096385542168677
www.pnas.org
4
2.4096385542168677
aaltodoc.aalto.fi
3
1.8072289156626506
citeseerx.ist.psu.edu
3
1.8072289156626506
digital.csic.es
3
1.8072289156626506
QUERY: SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result WHERE hit=1) AS percent FROM crawl_result WHERE hit=1 GROUP BY initial_domain ORDER BY COUNT(*) DESC LIMIT 20;
Top non-successful, final domains where crawl paths terminated before a successful hit (but crawl did run):
final_domain
COUNT(*)
www.medicaljournals.se
21
www.nature.com
21
ajpgi.physiology.org
14
jn.physiology.org
12
naukaru.ru
12
web.mit.edu
11
www.nada.kth.se
11
medicaljournals.se
10
www.site.uottawa.ca
10
www.tandfonline.com
10
academic.oup.com
9
www.amjbot.org
9
www.efmaefm.org
9
ajpcell.physiology.org
8
ajpheart.physiology.org
8
pdfs.journals.lww.com
8
www.osti.gov
8
ajpregu.physiology.org
7
pubs.rsna.org
7
download.atlantis-press.com
6
QUERY: SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NOT NULL GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
Top uncrawled, initial domains, where the crawl didn't even attempt to run:
initial_domain
COUNT(*)
QUERY: SELECT initial_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NULL GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20;
Top blocked, final domains:
final_domain
COUNT(*)
140.115.82.191
1
classes.maxwell.syr.edu
1
drona.csa.iisc.ernet.in
1
lamar.colostate.edu
1
linux46.ma.utexas.edu
1
mathro.fpms.ac.be
1
pdl.cmu.edu
1
sammelpunkt.philo.at
1
suma.ldc.usb.ve
1
virtualmentor.ama-assn.org
1
www.cais.ntu.edu.sg
1
www.cse.ucla.edu
1
www.ece.stevens-tech.edu
1
www.lance.colostate.edu
1
www2.asanet.org
1
QUERY: SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND (final_status_code='-61' OR final_status_code='-2') GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
Top rate-limited, final domains:
final_domain
COUNT(*)
www.researchgate.net
6
openknowledge.worldbank.org
1
QUERY: SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code='429' GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;
Top failure status codes:
final_status_code
COUNT(*)
404
112
301
85
403
61
302
60
-6
36
303
21
-2
15
429
7
503
7
200
5
QUERY: SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=0 GROUP BY final_status_code ORDER BY count(*) DESC LIMIT 10;
A handful of random success lines:
QUERY: SELECT identifier, initial_url, breadcrumbs, final_url, final_sha1, final_mimetype FROM crawl_result WHERE hit=1 ORDER BY random() LIMIT 10;
Handful of random non-success lines:
QUERY: SELECT identifier, initial_url, breadcrumbs, final_url, final_status_code, final_mimetype FROM crawl_result WHERE hit=0 ORDER BY random() LIMIT 25;