Crawl QA Report

This crawl report is auto-generated from a sqlite database file, which should be available/included.

Seedlist Stats

identifiers uris domains
480 583 163
QUERY: SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT initial_domain) AS domains FROM crawl_result;

FTP seed URLs

ftp_urls
0
QUERY: SELECT COUNT(*) as ftp_urls FROM crawl_result WHERE initial_url LIKE 'ftp://%';

Successful Hits

identifiers uris unique_sha1
63 166 166
QUERY: SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT final_sha1) as unique_sha1 FROM crawl_result WHERE hit=1;

De-duplication percentage (aka, fraction of hits where content had been crawled and identified previously):

percent
47.59036144578313
QUERY: SELECT 100. * AVG(final_was_dedupe) as percent FROM crawl_result WHERE hit=1;

Top mimetypes for successful hits (these are usually filtered to a fixed list in post-processing):

final_mimetype COUNT(*)
application/pdf 161
application/octet-stream 5
QUERY: SELECT final_mimetype, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_mimetype ORDER BY COUNT(*) DESC LIMIT 10;

Most popular breadcrumbs (a measure of how hard the crawler had to work):

breadcrumbs COUNT(*)
- 125
R 39
L 2
QUERY: SELECT breadcrumbs, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY breadcrumbs ORDER BY COUNT(*) DESC LIMIT 10;

FTP vs. HTTP hits (200 is HTTP, 226 is FTP):

final_status_code COUNT(*)
200 166
QUERY: SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_status_code LIMIT 10;

Domain Summary

Top initial domains:

initial_domain COUNT(*) percent
www.nature.com 22 3.7735849056603774
www.medicaljournals.se 21 3.6020583190394513
ajpgi.physiology.org 14 2.4013722126929675
jn.physiology.org 12 2.058319039451115
naukaru.ru 12 2.058319039451115
www.physiology.org 12 2.058319039451115
web.mit.edu 11 1.8867924528301887
www.nada.kth.se 11 1.8867924528301887
medicaljournals.se 10 1.7152658662092624
www.jstage.jst.go.jp 10 1.7152658662092624
www.site.uottawa.ca 10 1.7152658662092624
www.tandfonline.com 10 1.7152658662092624
academic.oup.com 9 1.5437392795883362
iopscience.iop.org 9 1.5437392795883362
www.amjbot.org 9 1.5437392795883362
www.efmaefm.org 9 1.5437392795883362
ajpcell.physiology.org 8 1.3722126929674099
ajpheart.physiology.org 8 1.3722126929674099
content.iospress.com 8 1.3722126929674099
link.springer.com 8 1.3722126929674099
QUERY: SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result) as percent FROM crawl_result GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20;

Top successful, final domains, where hits were found:

initial_domain COUNT(*) percent
www.physiology.org 12 7.228915662650603
www.jstage.jst.go.jp 10 6.024096385542169
content.iospress.com 8 4.819277108433735
digital.library.unt.edu 7 4.216867469879518
files.eccomasproceedia.org 7 4.216867469879518
link.springer.com 7 4.216867469879518
www.scielo.br 7 4.216867469879518
www.termedia.pl 7 4.216867469879518
ijpsr.com 6 3.6144578313253013
uvadoc.uva.es 6 3.6144578313253013
www.jafs.com.pl 6 3.6144578313253013
hal.archives-ouvertes.fr 5 3.0120481927710845
iopscience.iop.org 5 3.0120481927710845
www.cambridge.org 5 3.0120481927710845
digitool.library.mcgill.ca 4 2.4096385542168677
www.ejgm.co.uk 4 2.4096385542168677
www.pnas.org 4 2.4096385542168677
aaltodoc.aalto.fi 3 1.8072289156626506
citeseerx.ist.psu.edu 3 1.8072289156626506
digital.csic.es 3 1.8072289156626506
QUERY: SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result WHERE hit=1) AS percent  FROM crawl_result WHERE hit=1 GROUP BY initial_domain ORDER BY COUNT(*) DESC LIMIT 20;

Top non-successful, final domains where crawl paths terminated before a successful hit (but crawl did run):

final_domain COUNT(*)
www.medicaljournals.se 21
www.nature.com 21
ajpgi.physiology.org 14
jn.physiology.org 12
naukaru.ru 12
web.mit.edu 11
www.nada.kth.se 11
medicaljournals.se 10
www.site.uottawa.ca 10
www.tandfonline.com 10
academic.oup.com 9
www.amjbot.org 9
www.efmaefm.org 9
ajpcell.physiology.org 8
ajpheart.physiology.org 8
pdfs.journals.lww.com 8
www.osti.gov 8
ajpregu.physiology.org 7
pubs.rsna.org 7
download.atlantis-press.com 6
QUERY: SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NOT NULL GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;

Top uncrawled, initial domains, where the crawl didn't even attempt to run:

initial_domain COUNT(*)
QUERY: SELECT initial_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NULL GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20;

Top blocked, final domains:

final_domain COUNT(*)
140.115.82.191 1
classes.maxwell.syr.edu 1
drona.csa.iisc.ernet.in 1
lamar.colostate.edu 1
linux46.ma.utexas.edu 1
mathro.fpms.ac.be 1
pdl.cmu.edu 1
sammelpunkt.philo.at 1
suma.ldc.usb.ve 1
virtualmentor.ama-assn.org 1
www.cais.ntu.edu.sg 1
www.cse.ucla.edu 1
www.ece.stevens-tech.edu 1
www.lance.colostate.edu 1
www2.asanet.org 1
QUERY: SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND (final_status_code='-61' OR final_status_code='-2') GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;

Top rate-limited, final domains:

final_domain COUNT(*)
www.researchgate.net 6
openknowledge.worldbank.org 1
QUERY: SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code='429' GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20;

Status Summary

Top failure status codes:

final_status_code COUNT(*)
404 112
301 85
403 61
302 60
-6 36
303 21
-2 15
429 7
503 7
200 5
QUERY: SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=0 GROUP BY final_status_code ORDER BY count(*) DESC LIMIT 10;

Example Results

A handful of random success lines:

identifier initial_url breadcrumbs final_url final_sha1 final_mimetype
10.1017/s0022149x00006660 https://www.cambridge.org/core/services/aop-cambridge-core/content/view/A291CBD43AD6F7FA0F44E6592E214060/S0022149X00006660a.pdf/div-class-title-jhl-volume-54-issue-4-cover-and-back-matter-div.pdf - https://www.cambridge.org/core/services/aop-cambridge-core/content/view/A291CBD43AD6F7FA0F44E6592E214060/S0022149X00006660a.pdf/div-class-title-jhl-volume-54-issue-4-cover-and-back-matter-div.pdf W7UGJ7XAIILAEZFHH73FZ7XH5XRUENOZ application/pdf
10.7712/100016.2380.8613 https://files.eccomasproceedia.org/papers/eccomas-congress-2016/8613.pdf?mtime=20170308165111 - https://files.eccomasproceedia.org/papers/eccomas-congress-2016/8613.pdf?mtime=20170308165111 FM5ZQWTUQ2N7T7SXFNLCVA6N5RWQRTI6 application/pdf
https://aaltodoc.aalto.fi/bitstream/handle/123456789/17665/A1_hakonen_pertti_j_1987.pdf;jsessionid=F5E9AAC28EEB3F2E2ECA2997AA0A194B?sequence=1 R https://aaltodoc.aalto.fi/bitstream/handle/123456789/17665/A1_hakonen_pertti_j_1987.pdf;jsessionid=F5E9AAC28EEB3F2E2ECA2997AA0A194B?sequence=1 4OUP6PQQ6CISN26ZSYSI7YK4QZG2VBCH application/pdf
https://hal.archives-ouvertes.fr/hal-01578692/document - https://hal.archives-ouvertes.fr/hal-01578692/document 6USL3UAMYQSKX2CLZXZ3N7YA7RBE4MAZ application/pdf
http://www.jafs.com.pl/pdf-80904-17172?filename=Effect - http://www.jafs.com.pl/pdf-80904-17172?filename=Effect WHHSO2BB3AYSYOMNWAQLFJXA6RSDK4SZ application/pdf
10.1109/lcomm.2012.120312.121675 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.337.8390&rep=rep1&type=pdf - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.337.8390&rep=rep1&type=pdf LP4ZFJ36GN6N7PKWSCLXFSQQFHTEZD3O application/pdf
http://www.jafs.com.pl/pdf-77058-14511?filename=Effects - http://www.jafs.com.pl/pdf-77058-14511?filename=Effects YCHB676GBGVZH5O5CAH7EM2USTRVH5VL application/pdf
https://content.iospress.com/download/information-services-and-use/isu851?id=information-services-and-use%2Fisu851 - https://content.iospress.com/download/information-services-and-use/isu851?id=information-services-and-use%2Fisu851 NFITUUUWEGUOI6OWWBVI45Z5JQQV4QBI application/pdf
10.1007/bf02907787 https://link.springer.com/content/pdf/10.1007%2FBF02907787.pdf - https://link.springer.com/content/pdf/10.1007%2FBF02907787.pdf GF4XYUGTDKK4JL7FFLTJXMJJAZLCPQZ2 application/pdf
10.2172/73948 https://digital.library.unt.edu/ark:/67531/metadc704352/m2/1/high_res_d/73948.pdf - https://digital.library.unt.edu/ark:/67531/metadc704352/m2/1/high_res_d/73948.pdf KKSZMZOTULQNXFHQKO4VGMXWI36NIZKH application/pdf
QUERY: SELECT identifier, initial_url, breadcrumbs, final_url, final_sha1, final_mimetype FROM crawl_result WHERE hit=1 ORDER BY random() LIMIT 10;

Handful of random non-success lines:

identifier initial_url breadcrumbs final_url final_status_code final_mimetype
10.1109/78.661335 http://www-sccm.stanford.edu/Students/vanderveen/SPtrans98b.ps.gz - http://www-sccm.stanford.edu/Students/vanderveen/SPtrans98b.ps.gz -6 application/octet-stream
10.1109/mobhoc.2009.5336965 http://www.cis.umassd.edu/%7Exbai/pubs/J-DirectionalCoverage.pdf - http://www.cis.umassd.edu/%7Exbai/pubs/J-DirectionalCoverage.pdf 404 text/html
10.2340/00015555-1505 https://www.medicaljournals.se/acta/content_files/download.php?doi=10.2340/00015555-1505 - https://www.medicaljournals.se/acta/content_files/download.php?doi=10.2340/00015555-1505 403 text/html
10.1016/s0166-3542(01)00195-4 http://dissertations.ub.rug.nl/FILES/faculties/science/2001/b.w.a.van.der.strate/c1.pdf - http://dissertations.ub.rug.nl/FILES/faculties/science/2001/b.w.a.van.der.strate/c1.pdf -6 application/octet-stream
10.1145/996566.996624 http://www2.dac.com/41st/41acceptedpapers.nsf/0c4c09c6ffa905c487256b7b007afb72/b23ec16f6e1fc42c87256e54007a1f0a/$file/13_3.pdf - http://www2.dac.com/41st/41acceptedpapers.nsf/0c4c09c6ffa905c487256b7b007afb72/b23ec16f6e1fc42c87256e54007a1f0a/$file/13_3.pdf 404 text/html
10.1080/07438141.2011.627625 http://www.tandfonline.com/doi/pdf/10.1080/07438141.2011.627625?needAccess=true - http://www.tandfonline.com/doi/pdf/10.1080/07438141.2011.627625?needAccess=true 302 text/html
10.1152/physiolgenomics.00296.2005 http://physiolgenomics.physiology.org/content/physiolgenomics/26/1/91.full.pdf - http://physiolgenomics.physiology.org/content/physiolgenomics/26/1/91.full.pdf 301 application/octet-stream
10.1111/j.1540-6261.2006.01064.x http://www.efmaefm.org/efmsympo2005/accepted_papers/06-Neil_Brisley_paper.pdf - http://www.efmaefm.org/efmsympo2005/accepted_papers/06-Neil_Brisley_paper.pdf 404 text/html
10.1109/18.923725 http://web.mit.edu/bchen/www/pubs/it01-chen.pdf - http://web.mit.edu/bchen/www/pubs/it01-chen.pdf 404 text/html
10.2991/iccia.2012.347 http://download.atlantis-press.com/php/download_paper.php?id=4295 - http://download.atlantis-press.com/php/download_paper.php?id=4295 301 text/html
10.1126/science.1164647 https://www.orgchem.science.ru.nl/pubs/10.1126_1668.pdf - https://www.orgchem.science.ru.nl/pubs/10.1126_1668.pdf 403 text/html
10.1080/000155500750012298 https://medicaljournals.se/acta/content_files/download.php?doi=10.1080/000155500750012298 - https://medicaljournals.se/acta/content_files/download.php?doi=10.1080/000155500750012298 403 text/html
10.1109/icpr.1996.546998 http://www.ee.ed.ac.uk/~sasg/Papers/96_papers/ICPR96_whn.ps - http://www.ee.ed.ac.uk/~sasg/Papers/96_papers/ICPR96_whn.ps -6 application/octet-stream
10.1137/s106482750241565x http://www.seas.upenn.edu/~biros/papers/lnks/paper.pdf R https://www.seas.upenn.edu/~biros/papers/lnks/paper.pdf 404 text/html
10.2340/00015555-1046 https://www.medicaljournals.se/acta/content_files/download.php?doi=10.2340/00015555-1046 - https://www.medicaljournals.se/acta/content_files/download.php?doi=10.2340/00015555-1046 403 text/html
10.2991/sschd-16.2016.23 http://download.atlantis-press.com/php/download_paper.php?id=25860593 R https://download.atlantis-press.com/php/download_paper.php?id=25860593 302 application/octet-stream
10.1152/jn.2001.85.6.2613 http://www.nada.kth.se/~anfa/smalllargeforce.pdf - http://www.nada.kth.se/~anfa/smalllargeforce.pdf 403 text/html
10.1152/jn.00416.2002 http://jn.physiology.org/content/jn/89/1/12.full.pdf - http://jn.physiology.org/content/jn/89/1/12.full.pdf 301 application/octet-stream
10.1152/physiolgenomics.00086.2011 http://physiolgenomics.physiology.org/content/physiolgenomics/43/21/1241.full.pdf - http://physiolgenomics.physiology.org/content/physiolgenomics/43/21/1241.full.pdf 301 application/octet-stream
10.3732/ajb.1300036 http://www.amjbot.org/content/100/10/2016.full.pdf - http://www.amjbot.org/content/100/10/2016.full.pdf 404 text/html
10.2139/ssrn.1458963 http://www.efmaefm.org/0EFMAMEETINGS/EFMA%20ANNUAL%20MEETINGS/2010-Aarhus/EFMA2010_0074_fullpaper.pdf - http://www.efmaefm.org/0EFMAMEETINGS/EFMA%20ANNUAL%20MEETINGS/2010-Aarhus/EFMA2010_0074_fullpaper.pdf 503 text/html
10.1152/ajpgi.00160.2012 http://ajpgi.physiology.org/content/ajpgi/304/10/G897.full.pdf - http://ajpgi.physiology.org/content/ajpgi/304/10/G897.full.pdf 301 application/octet-stream
10.1080/09853111.2007.9736326 https://www.tandfonline.com/doi/pdf/10.1080/09853111.2007.9736326?needAccess=true R https://www.tandfonline.com/doi/pdf/10.1080/09853111.2007.9736326?needAccess=true&cookieSet=1 302 text/html
10.1152/japplphysiol.00624.2004 http://jap.physiology.org/content/jap/99/2/665.full.pdf - http://jap.physiology.org/content/jap/99/2/665.full.pdf 301 application/octet-stream
10.4304/jnw.4.6.436-444 http://academypublisher.net/jnw/vol04/no06/jnw0406436444.pdf - http://academypublisher.net/jnw/vol04/no06/jnw0406436444.pdf -6 application/octet-stream
QUERY: SELECT identifier, initial_url, breadcrumbs, final_url, final_status_code, final_mimetype FROM crawl_result WHERE hit=0 ORDER BY random() LIMIT 25;