diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-02-28 12:01:12 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-02-28 12:01:12 -0800 |
commit | dd50d0bb50549598da2e9ed595a12f154636ae2e (patch) | |
tree | 7220a89aac25fbd77acdd516f8f4845ee3e44852 | |
parent | b5f283913367d28578ad403b850e3a5c2e380887 (diff) | |
download | arabesque-dd50d0bb50549598da2e9ed595a12f154636ae2e.tar.gz arabesque-dd50d0bb50549598da2e9ed595a12f154636ae2e.zip |
import arabesque script
-rw-r--r-- | README.md | 327 | ||||
-rwxr-xr-x | arabesque.py | 665 | ||||
-rw-r--r-- | report_template.md | 108 | ||||
-rw-r--r-- | test.sqlite | bin | 0 -> 57344 bytes | |||
-rw-r--r-- | test.tsv | 10 |
5 files changed, 1110 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..2ae80e8 --- /dev/null +++ b/README.md @@ -0,0 +1,327 @@ + +Going to look something like: + + zcat DOI-LANDING-CRAWL-2018-06-full_crawl_logs/DOI-LANDING-CRAWL-2018-06.$SHARD.us.archive.org.crawl.log.gz | tr -cd '[[:print:]]\n\r\t' | rg '//doi.org/' | /fast/scratch/unpaywall/make_doi_list.py > doi_list.$SHARD.txt + + zcat /fast/unpaywall-munging/DOI-LANDING-CRAWL-2018-06/DOI-LANDING-CRAWL-2018-06-full_crawl_logs/DOI-LANDING-CRAWL-2018-06.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD.db + + cat /fast/unpaywall-munging/DOI-LANDING-CRAWL-2018-06/doi_list.$SHARD.txt | pv | /fast/scratch/unpaywall/make_output.py redirectmap.$SHARD.db > doi_index.$SHARD.tsv + +Let's start with: + + mkdir UNPAYWALL-PDF-CRAWL-2018-07 + ia download UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs + +export SHARD=wbgrp-svc279 # running +export SHARD=wbgrp-svc280 # running +export SHARD=wbgrp-svc281 # running +export SHARD=wbgrp-svc282 # running +zcat UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs/UNPAYWALL-PDF-CRAWL-2018-07.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD.db +zcat UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs/UNPAYWALL-PDF-CRAWL-2018-07-PATCH.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD-PATCH.db + +### Design + +If possible, we'd like something that will work with as many crawls as +possible. Want to work with shards, then merge outputs. + +Output: JSON and/or sqlite rows with: + +- identifier (optional?) +- initial-uri (indexed) +- breadcrumbs +- final-uri +- final-http-status +- final-sha1 +- final-mimetype-normalized +- final-was-dedupe (boolean) +- final-cdx (string, if would be extracted) + +This will allow filtering on various fields, checking success stats, etc. + +Components: + +- {identifier, initial-uri} input (basically, seedlist) +- full crawl logs +- raw CDX, indexed by final-uri +- referer map + +Process: + +- use full crawl logs to generate a referer map; this is a dict with keys as + URI, and value as {referer URI, status, breadcrumb, was-dedupe, mimetype}; + the referer may be null. database can be whatever. +- iterate through CDX, filtering by HTTP status and mimetype (including + revists). for each potential, lookup in referer map. if mimetype is + confirmed, then iterate through full referer chain, and print a final line + which is all-but-identifier +- iterate through identifier/URI list, inserting identifier columns + +Complications: + +- non-PDF terminals: error codes, or HTML only (failed to find PDF) +- multiple terminals per seed; eg, multiple PDFs, or PDF+postscript+HTML or + whatever + +Process #2: + +- use full crawl logs to generate a bi-directional referer map: sqlite3 table + with uri, referer-uri both indexed. also {status, breadcrumb, was-dedupe, + mimetype} rows +- iterate through CDX, selecting successful "terminal" lines (mime, status). + use referer map to iterate back to an initial URI, and generate a row. lookup + output table by initial-uri; if an entry already exists, behavior is + flag-dependent: overwrite if "better", or add a second line +- in a second pass, update rows with identifier based on URI. if rows not + found/updated, then do a "forwards" lookup to a terminal condition, and write + that status. note that these rows won't have CDX. + +More Complications: + +- handling revisits correctly... raw CDX probably not actually helpful for PDF + case, only landing/HTML case +- given above, should probably just (or as a mode) iterate over only crawl logs + in "backwards" stage +- fan-out of "forward" redirect map, in the case of embeds and PDF link + extraction +- could pull out first and final URI domains for easier SQL stats/reporting +- should include final datetime (for wayback lookups) + +NOTE/TODO: journal-level dumps of fatcat metadata would be cool... could +roll-up release dumps as an alternative to hitting elasticsearch? or just hit +elasticsearch and both dump to sqlite and enrich elastic doc? should probably +have an indexed "last updated" timestamp in all elastic docs + +### Crawl Log Notes + +Fields: + + 0 timestamp (ISO8601) of log line + 1 status code (HTTP or negative) + 2 size in bytes (content only) + 3 URI of this download + 4 discovery breadcrumbs + 5 "referer" URI + 6 mimetype (as reported?) + 7 worker thread + 8 full timestamp (start of network fetch; this is dt?) + 9 SHA1 + 10 source tag + 11 annotations + 12 partial CDX JSON + +### External Prep for, Eg, Unpaywall Crawl + + export LC_ALL=C + sort -S 8G -u seedlist.shard > seedlist.shard.sorted + + zcat unpaywall_20180621.pdf_meta.tsv.gz | awk '{print $2 "\t" $1}' | sort -S 8G -u > unpaywall_20180621.seed_id.tsv + + join -t $'\t' unpaywall_20180621.seed_id.tsv unpaywall_crawl_patch_seedlist.split_3.schedule.sorted > seed_id.shard.tsv + +TODO: why don't these sum/match correctly? + + bnewbold@orithena$ wc -l seed_id.shard.tsv unpaywall_crawl_patch_seedlist.split_3.schedule.sorted + 880737 seed_id.shard.tsv + 929459 unpaywall_crawl_patch_seedlist.split_3.schedule.sorted + + why is: + http://00ec89c.netsolhost.com/brochures/200605_JAWMA_Hg_Paper_Lee_Hastings.pdf + in unpaywall_crawl_patch_seedlist, but not unpaywall_20180621.pdf_meta? + + # Can't even filter on HTTP 200, because revisits are '-' + #zcat UNPAYWALL-PDF-CRAWL-2018-07.cdx.gz | rg 'wbgrp-svc282' | rg ' 200 ' | rg '(pdf)|(revisit)' > UNPAYWALL-PDF-CRAWL-2018-07.svc282.filtered.cdx + + zcat UNPAYWALL-PDF-CRAWL-2018-07.cdx.gz | rg 'UNPAYWALL-PDF-CRAWL-2018-07-PATCH' | rg 'wbgrp-svc282' | rg '(pdf)|( warc/revisit )|(postscript)|( unk )' > UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx + +TODO: spaces in URLs, like 'https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf' + +### Revisit Notes + +Neither CDX nor crawl logs seem to have revisits actually point to final +content, they just point to the revisit record in the (crawl-local) WARC. + +### sqlite3 stats + + select count(*) from crawl_result; + + select count(*) from crawl_result where identifier is null; + + select breadcrumbs, count(*) from crawl_result group by breadcrumbs; + + select final_was_dedupe, count(*) from crawl_result group by final_was_dedupe; + + select final_http_status, count(*) from crawl_result group by final_http_status; + + select final_mimetype, count(*) from crawl_result group by final_mimetype; + + select * from crawl_result where final_mimetype = 'text/html' and final_http_status = '200' order by random() limit 5; + + select count(*) from crawl_result where final_uri like 'https://academic.oup.com/Govern%'; + + select count(distinct identifier) from crawl_result where final_sha1 is not null; + +### testing shard notes + +880737 `seed_id` lines +21776 breadcrumbs are null (no crawl logs line); mostly normalized URLs? +24985 "first" URIs with no identifier; mostly normalized URLs? + +backward: Counter({'skip-cdx-scope': 807248, 'inserted': 370309, 'skip-map-scope': 2913}) +forward (dirty): Counter({'inserted': 509242, 'existing-id-updated': 347218, 'map-uri-missing': 15556, 'existing-complete': 8721, '_normalized-seed-uri': 5520}) + +874131 identifier is not null +881551 breadcrumbs is not null +376057 final_mimetype is application/pdf +370309 final_sha1 is not null +332931 application/pdf in UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx + +summary: + 370309/874131 42% got a PDF + 264331/874131 30% some domain dead-end + 196747/874131 23% onlinelibrary.wiley.com + 33879/874131 4% www.nature.com + 11074/874131 1% www.tandfonline.com + 125883/874131 14% blocked, 404, other crawl failures + select count(*) from crawl_result where final_http_status >= '400' or final_http_status < '200'; + 121028/874131 14% HTTP 200, but not pdf + 105317/874131 12% academic.oup.com; all rate-limited or cookie fail + 15596/874131 1.7% didn't even try crawling (null final status) + +TODO: +- add "success" flag (instead of "final_sha1 is null") +- + + http://oriental-world.org.ua/sites/default/files/Archive/2017/3/4.pdf 10.15407/orientw2017.03.021 - http://oriental-world.org.ua/sites/default/files/Archive/2017/3/4.pdf 403 ¤ application/pdf 0 ¤ + +Iterated: + +./arabesque.py backward UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx map.sqlite out.sqlite +Counter({'skip-cdx-scope': 813760, 'inserted': 370435, 'skip-map-scope': 4620, 'skip-tiny-octetstream-': 30}) + +./arabesque.py forward unpaywall_20180621.seed_id.shard.tsv map.sqlite out.sqlite +Counter({'inserted': 523594, 'existing-id-updated': 350009, '_normalized-seed-uri': 21371, 'existing-complete': 6638, 'map-uri-missing': 496}) + +894029 breadcrumbs is not null +874102 identifier is not null +20423 identifier is null +496 breadcrumbs is null +370435 final_sha1 is not null + +### URL/seed non-match issues! + +Easily fixable: +- capitalization of domains +- empty port number, like `http://genesis.mi.ras.ru:/~razborov/hadamard.ps` + +Encodable: +- URL encoding + http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11[1].pdf + http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11%5B1%5D.pdf +- whitespace in URL (should be url-encoded) + https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf + https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A%EF%BF%BD%EF%BF%BDcase.pdf +- tricky hidden unicode + http://goldhorde.ru/wp-content/uploads/2017/03/ЗО-1-2017-206-212.pdf + http://goldhorde.ru/wp-content/uploads/2017/03/%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD-1-2017-206-212.pdf + +Harder/Custom? +- paths including "/../" or "/./" are collapsed +- port number 80, like `http://fermet.misis.ru:80/jour/article/download/724/700` +- aos2.uniba.it:8080papers + +- fragments stripped by crawler: 'https://www.termedia.pl/Journal/-85/pdf-27083-10?filename=BTA#415-06-str307-316.pdf' + +### Debugging "redirect terminal" issue + +Some are redirect loops; fine. + +Some are from 'cookieSet=1' redirects, like 'http://journals.sagepub.com/doi/pdf/10.1177/105971230601400206?cookieSet=1'. This comes through like: + + sqlite> select * from crawl_result where initial_uri = 'http://adb.sagepub.com/cgi/reprint/14/2/147.pdf'; + initial_uri identifier breadcrumbs final_uri final_http_status final_sha1 final_mimetype final_was_dedupe final_cdx + http://adb.sagepub.com/cgi/reprint/14/2/147.pdf 10.1177/105971230601400206 R http://journals.sagepub.com/doi/pdf/10.1177/105971230601400206 302 ¤ text/html 0 ¤ + +Using 'http' (note: this is not an OA article): + + http://adb.sagepub.com/cgi/reprint/14/2/147.pdf + https://journals.sagepub.com/doi/pdf/10.1177/105971230601400206 + https://journals.sagepub.com/doi/pdf/10.1177/105971230601400206?cookieSet=1 + http://journals.sagepub.com/action/cookieAbsent + +Is heritrix refusing to do that second redirect? In some cases it will do at +leat the first, like: + + http://pubs.rsna.org/doi/pdf/10.1148/radiographics.11.1.1996385 + http://pubs.rsna.org/doi/pdf/10.1148/radiographics.11.1.1996385?cookieSet=1 + http://pubs.rsna.org/action/cookieAbsent + +I think the vast majority of redirect terminals are when we redirect to a page +that has already been crawled. This is a bummer because we can't find the +redirect target in the logs. + +Eg, academic.oup.com sometimes redirects to cookieSet, then cookieAbsent; other +times it redirects to Governer. It's important to distinguish between these. + +### Scratch + +What are actual advantages/use-cases of CDX mode? +=> easier CDX-to-WARC output mode +=> sending CDX along with WARCs as an index + +Interested in scale-up behavior: full unpaywall PDF crawls, and/or full DOI landing crawls +=> eatmydata +dentifier is not null + + + zcat UNPAYWALL-PDF-CRAWL-2018-07-PATCH* | time /fast/scratch/unpaywall/arabesque.py referrer - UNPAYWALL-PDF-CRAWL-2018-07-PATCH.map.sqlite + [snip] + ... referrer 5542000 + Referrer map complete. + 317.87user 274.57system 21:20.22elapsed 46%CPU (0avgtext+0avgdata 22992maxresident)k + 24inputs+155168464outputs (0major+802114minor)pagefaults 0swaps + + bnewbold@ia601101$ ls -lathr + -rw-r--r-- 1 bnewbold bnewbold 1.7G Dec 12 12:33 UNPAYWALL-PDF-CRAWL-2018-07-PATCH.map.sqlite + +Scaling! + + 16,736,800 UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.us.archive.org.crawl.log + 17,215,895 unpaywall_20180621.seed_id.tsv + +Oops; need to shard the seed_id file. + +Ugh, this one is a little derp because I didn't sort correctly. Let's say close enough though... + + 4318674 unpaywall_crawl_seedlist.svc282.tsv + 3901403 UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.seed_id.tsv + + +/fast/scratch/unpaywall/arabesque.py everything CORE-UPSTREAM-CRAWL-2018-11.combined.log core_2018-03-01_metadata.seed_id.tsv CORE-UPSTREAM-CRAWL-2018-11.out.sqlite + + Counter({'inserted': 3226191, 'skip-log-scope': 2811395, 'skip-log-prereq': 108932, 'skip-tiny-octetstream-': 855, 'skip-map-scope': 2}) + Counter({'existing-id-updated': 3221984, 'inserted': 809994, 'existing-complete': 228909, '_normalized-seed-uri': 17287, 'map-uri-missing': 2511, '_redirect-recursion-limit': 221, 'skip-bad-seed-uri': 17}) + +time /fast/scratch/unpaywall/arabesque.py everything UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.us.archive.org.crawl.log UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.seed_id.tsv UNPAYWALL-PDF-CRAWL-2018-07.out.sqlite + + Everything complete! + Counter({'skip-log-scope': 13476816, 'inserted': 2536452, 'skip-log-prereq': 682460, 'skip-tiny-octetstream-': 41067}) + Counter({'existing-id-updated': 1652463, 'map-uri-missing': 1245789, 'inserted': 608802, 'existing-complete': 394349, '_normalized-seed-uri': 22573, '_redirect-recursion-limit': 157}) + + real 63m42.124s + user 53m31.007s + sys 6m50.535s + +### Performance + +Before tweaks: + + real 2m55.975s + user 2m6.772s + sys 0m12.684s + +After: + + real 1m51.500s + user 1m44.600s + sys 0m3.496s + diff --git a/arabesque.py b/arabesque.py new file mode 100755 index 0000000..a4b3e07 --- /dev/null +++ b/arabesque.py @@ -0,0 +1,665 @@ +#!/usr/bin/env python3 + +""" +This is a multi-function script for generating a particular type of crawl +report output: a table of identifiers, seed URLs, and crawl results, taking in +to account long redirect chains. + +Commands/modes: +- referrer <input.log> <output-map.sqlite> +- backward_cdx <input.cdx> <input-map.sqlite> <output.sqlite> +- backward <input.log> <input-map.sqlite> <output.sqlite> +- forward <input.seed_identifiers> <output.sqlite> +- everything <input.log> <input.cdx> <input.seed_identifiers> <output.sqlite> + +Design docs in README_pdf_crawl.md + +TODO: +- open map in read-only when appropriate +- some kind of stats dump command? (querying sqlite) +- should referrer map be UNIQ? +- forward outputs get generated multiple times? +- try: https://pypi.org/project/urlcanon/ +- https://www.talisman.org/~erlkonig/misc/lunatech%5Ewhat-every-webdev-must-know-about-url-encoding/ + +BUG: +BAD LOG LINE: 2018-07-27T12:26:24.783Z 200 24 http://www.phywe-es.com/robots.txt 15+LREELLLLLRLELLRLLLRLLLLRLELRRLLLLLLLLLLLLLLLLLLLLP http://www.phywe-es.com/index.php/fuseaction/download/lrn_file/versuchsanleitungen/P2522015/tr/P2522015.pdf text/html #296 20180727122622741+438 sha1:YR6M6GSJYJGMLBBEGCVHLRZO6SISSJAS - unsatisfiableCharsetInHeader:ISO 8859-1 {"contentSize":254,"warcFilename":"UNPAYWALL-PDF-CRAWL-2018-07-20180727122113315-14533-11460~wbgrp-svc282.us.archive.org~8443.warc.gz","warcFileOffset":126308355} +""" + +import sys +import json +import time +import urllib +import urllib3 +import sqlite3 +import argparse +import collections + +CrawlLine = collections.namedtuple('CrawlLine', [ + 'log_time', + 'status_code', + 'size_bytes', + 'url', + 'breadcrumbs', + 'referrer_url', + 'mimetype', + 'worker_thread', + 'timestamp', + 'sha1', + 'source_tag', + 'annotations', + 'cdx_json']) + +FullCdxLine = collections.namedtuple('FullCdxLine', [ + 'surt', + 'datetime', + 'url', + 'mimetype', + 'status_code', # will be '-' for warc/revist + 'sha1', + 'unused1', + 'unused2', + 'c_size', + 'offset', + 'warc']) + +ReferrerRow = collections.namedtuple('ReferrerRow', [ + 'url', + 'referrer_url', + 'status_code', + 'breadcrumbs', + 'mimetype', + 'is_dedupe']) + +NORMAL_MIMETYPE = ( + 'application/pdf', + 'application/postscript', + 'text/html', + 'text/xml', + 'warc/revisit', + 'application/octet-stream', +) + +FULLTEXT_MIMETYPES = ( + "application/pdf", + "application/postscript", + "application/octet-stream", +) + +def normalize_mimetype(raw): + raw = raw.lower() + raw = raw.replace('"', '').replace("'", '').replace(',', '') + for norm in NORMAL_MIMETYPE: + if raw.startswith(norm): + return norm + + # Special cases + if raw.startswith('application/xml'): + return 'text/xml' + if raw.startswith('application/x-pdf'): + return 'application/pdf' + if raw in ('unk', 'unknown', 'other'): + return 'application/octet-stream' + return raw + +def normalize_url(raw): + """ + This is a surprisingly complex function that cleans up URLs. + + Particular issues it fixes: + - lower-cases scheme and host/domain + - removes blank or redundant port numbers + - removes fragments (anchor tags) + - URL escapes characters in path, particularly whitespace + + Some of these we maybe shouldn't do, but heritrix does by default + + TODO: heritrix removes fragments, but maybe we shouldn't? + """ + try: + u = urllib3.util.parse_url(raw.strip()) + except: + return None + + if u.path is None: + return None + port = u.port + if (port == 80 and u.scheme == 'http') or (port == 443 and u.scheme == 'https'): + port = None + + # "Dot Segment Removal" per RFC 3986 + # via https://stackoverflow.com/a/40536710/4682349 + segments = u.path.split('/') + segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]] + resolved = [] + for segment in segments: + if segment in ('../', '..'): + if resolved[1:]: + resolved.pop() + elif segment not in ('./', '.'): + resolved.append(segment) + path = ''.join(resolved) + + # add any missing slash if there is a query or fragment + if (u.query or u.fragment) and not path: + path = '/' + + # URL encode the path segment for HTTP URLs + if u.scheme in ('http', 'https'): + path = urllib.parse.quote(path, safe='/%~+:();$!,') + result = urllib3.util.Url(u.scheme, u.auth, u.host.lower(), port, path, u.query, None) + return result.url.replace(' ', '%20') + +def test_normalize_url(): + + assert (normalize_url('HTTP://ASDF.com/a/../b') == 'http://asdf.com/b') + assert (normalize_url('HTTP://ASDF.com/a/./b') == 'http://asdf.com/a/b') + assert (normalize_url('HTTP://ASDF.com:/') == 'http://asdf.com/') + assert (normalize_url('HTTP://ASDF.com:80/') == 'http://asdf.com/') + assert (normalize_url('HTTP://ASDF.com:80papers/123.pdf') == None) + assert (normalize_url('HTTP://ASDF.com/a.pdf#123') == 'http://asdf.com/a.pdf') + assert (normalize_url('HTTPs://ASDF.com:443/') == 'https://asdf.com/') + assert (normalize_url('HTTP://ASDF.com/a/../b') == 'http://asdf.com/b') + assert (normalize_url('HTTP://ASDF.com/first second') == 'http://asdf.com/first%20second') + assert (normalize_url('HTTP://ASDF.com/first%20second') == 'http://asdf.com/first%20second') + assert (normalize_url('Ftp://ASDF.com/a/../b') == 'ftp://asdf.com/b') + + #assert (normalize_url('http://goldhorde.ru/wp-content/uploads/2017/03/ЗО-1-2017-206-212.pdf') == + # 'http://goldhorde.ru/wp-content/uploads/2017/03/%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD-1-2017-206-212.pdf') + assert (normalize_url('http://goldhorde.ru/wp-content/uploads/2017/03/ЗО-1-2017-206-212.pdf') == + 'http://goldhorde.ru/wp-content/uploads/2017/03/%D0%97%D0%9E-1-2017-206-212.pdf') + + assert (normalize_url('http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11[1].pdf') == + 'http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11%5B1%5D.pdf') + #assert (normalize_url('https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf') == + # 'https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A%EF%BF%BD%EF%BF%BDcase.pdf') + assert (normalize_url('https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf') == + 'https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A%20case.pdf') + assert (normalize_url('http://mariel.inesc.pt/~lflb/ma98.pdf') == + 'http://mariel.inesc.pt/~lflb/ma98.pdf') + assert (normalize_url('http://ijpsr.com?action=download_pdf&postid=9952') == + 'http://ijpsr.com/?action=download_pdf&postid=9952') + assert (normalize_url('http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-0518(199702)35:3<587::AID-POLA25>3.0.CO;2-J/pdf') == + 'http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-0518(199702)35:3%3C587::AID-POLA25%3E3.0.CO;2-J/pdf') + assert (normalize_url('http://ntj.tax.org/wwtax/ntjrec.nsf/175d710dffc186a385256a31007cb40f/5e1815e49ceb7d318525796800526cf8/$FILE/A04_Cole.pdf') == + 'http://ntj.tax.org/wwtax/ntjrec.nsf/175d710dffc186a385256a31007cb40f/5e1815e49ceb7d318525796800526cf8/$FILE/A04_Cole.pdf') + assert (normalize_url('http://www.nature.com:80/') == + 'http://www.nature.com/') + assert (normalize_url('http://www.nature.com:80/polopoly_fs/1.22367!/menu/main/topColumns/topLeftColumn/pdf/547389a.pdf') == + 'http://www.nature.com/polopoly_fs/1.22367!/menu/main/topColumns/topLeftColumn/pdf/547389a.pdf') + assert (normalize_url('http://pdfs.journals.lww.com/transplantjournal/2012/11271/Extra_Pulmonary_Nocardiosis_and_Perigraft_Abscess,.1668.pdf?token=method|ExpireAbsolute;source|Journals;ttl|1503135985283;payload|mY8D3u1TCCsNvP5E421JYK6N6XICDamxByyYpaNzk7FKjTaa1Yz22MivkHZqjGP4kdS2v0J76WGAnHACH69s21Csk0OpQi3YbjEMdSoz2UhVybFqQxA7lKwSUlA502zQZr96TQRwhVlocEp/sJ586aVbcBFlltKNKo+tbuMfL73hiPqJliudqs17cHeLcLbV/CqjlP3IO0jGHlHQtJWcICDdAyGJMnpi6RlbEJaRheGeh5z5uvqz3FLHgPKVXJzdGlb2qsojlvlytk14LkMXSI/t5I2LVgySZVyHeaTj/dJdRvauPu3j5lsX4K1l3siV;hash|9tFBJUOSJ1hYPXrgBby2Xg==') == + 'http://pdfs.journals.lww.com/transplantjournal/2012/11271/Extra_Pulmonary_Nocardiosis_and_Perigraft_Abscess,.1668.pdf?token=method|ExpireAbsolute;source|Journals;ttl|1503135985283;payload|mY8D3u1TCCsNvP5E421JYK6N6XICDamxByyYpaNzk7FKjTaa1Yz22MivkHZqjGP4kdS2v0J76WGAnHACH69s21Csk0OpQi3YbjEMdSoz2UhVybFqQxA7lKwSUlA502zQZr96TQRwhVlocEp/sJ586aVbcBFlltKNKo+tbuMfL73hiPqJliudqs17cHeLcLbV/CqjlP3IO0jGHlHQtJWcICDdAyGJMnpi6RlbEJaRheGeh5z5uvqz3FLHgPKVXJzdGlb2qsojlvlytk14LkMXSI/t5I2LVgySZVyHeaTj/dJdRvauPu3j5lsX4K1l3siV;hash|9tFBJUOSJ1hYPXrgBby2Xg==') + +def parse_crawl_line(line): + # yup, it's just whitespace, and yup, there's a JSON blob at the end that + # "hopefully" contains no whitespace + line = line.strip().split() + if len(line) != 13: + return None + # FTP success; need to munge mimetype + if line[3].startswith('ftp://') and line[1] == "226" and line[6] == "application/octet-stream": + if line[3].lower().endswith('.pdf'): + line[6] = "application/pdf" + elif line[3].lower().endswith('.ps'): + line[6] = "application/postscript" + + # mimetype + line[6] = normalize_mimetype(line[6]) + # SHA1 + line[9] = line[9].replace('sha1:', '') + return CrawlLine(*line) + +def parse_full_cdx_line(line): + line = line.strip().split(' ') + assert len(line) == 11 + # mimetype + line[3] = normalize_mimetype(line[3]) + return FullCdxLine(*line) + +def lookup_referrer_row(cursor, url): + #print("Lookup: {}".format(cdx.url)) + raw = list(cursor.execute('SELECT * from referrer WHERE url=? LIMIT 1', [url])) + if not raw: + return None + raw = list(raw[0]) + if not raw[1] or raw[1] == '-': + raw[1] = None + return ReferrerRow(*raw) + +def lookup_all_referred_rows(cursor, url): + #print("Lookup: {}".format(cdx.url)) + # TODO: should this SORT BY? + result = list(cursor.execute('SELECT * from referrer WHERE referrer=?', [url])) + if not result: + return None + for i in range(len(result)): + raw = list(result[i]) + if not raw[1] or raw[1] == '-': + raw[1] = None + result[i] = ReferrerRow(*raw) + return result + +def create_out_table(db): + # "eat my data" style database, for speed + # NOTE: don't drop indexes here, because we often reuse DB + db.executescript(""" + PRAGMA main.page_size = 4096; + PRAGMA main.cache_size = 20000; + PRAGMA main.locking_mode = EXCLUSIVE; + PRAGMA main.synchronous = OFF; + PRAGMA main.journal_mode = MEMORY; + + CREATE TABLE IF NOT EXISTS crawl_result + (initial_url text NOT NULL, + identifier text, + initial_domain text, + breadcrumbs text, + final_url text, + final_domain text text, + final_timestamp text, + final_status_code text, + final_sha1 text, + final_mimetype text, + final_was_dedupe bool, + hit bool); + """) + +def referrer(log_file, map_db): + """ + TODO: this would probably be simpler, and much faster, as a simple sqlite3 import from TSV + """ + print("Mapping referrers from crawl logs") + # "eat my data" style database, for speed + map_db.executescript(""" + PRAGMA main.page_size = 4096; + PRAGMA main.cache_size = 20000; + PRAGMA main.locking_mode = EXCLUSIVE; + PRAGMA main.synchronous = OFF; + PRAGMA main.journal_mode = MEMORY; + + CREATE TABLE IF NOT EXISTS referrer + (url text, + referrer text, + status_code text, + breadcrumbs text, + mimetype text, + is_dedupe bool); + DROP INDEX IF EXISTS referrer_url; + DROP INDEX IF EXISTS referrer_referrer; + """) + c = map_db.cursor() + i = 0 + for raw in log_file: + line = parse_crawl_line(raw) + if not line: + print("BAD LOG LINE: {}".format(raw.strip())) + continue + if line.url.startswith('dns:') or line.url.startswith('whois:'): + #print("skipping: {}".format(line.url)) + continue + is_dedupe = 'duplicate:digest' in line.annotations + # insert {url, referrer, status_code, breadcrumbs, mimetype, is_dedupe} + c.execute("INSERT INTO referrer VALUES (?,?,?,?,?,?)", + (line.url, line.referrer_url, line.status_code, line.breadcrumbs, line.mimetype, is_dedupe)) + i = i+1 + if i % 5000 == 0: + print("... referrer {}".format(i)) + map_db.commit() + + map_db.commit() + print("Building indices (this can be slow)...") + c.executescript(""" + CREATE INDEX IF NOT EXISTS referrer_url on referrer (url); + CREATE INDEX IF NOT EXISTS referrer_referrer on referrer (referrer); + """) + c.close() + print("Referrer map complete.") + +def backward_cdx(cdx_file, map_db, output_db, hit_mimetypes=FULLTEXT_MIMETYPES): + """ + TODO: Hrm, just realized we don't usually have CDX files on a per-machine + basis. Need to call this with a shard string that gets parsed from the + WARC field of the full CDX for the whole crawl? Oh well. Large file, but + can be shared, and filter should be fast (can optimize with grep as well). + """ + print("Mapping backward from CDX 200/226 to initial urls") + counts = collections.Counter({'inserted': 0}) + m = map_db.cursor() + create_out_table(output_db) + c = output_db.cursor() + i = 0 + for raw_cdx in cdx_file: + if raw_cdx.startswith('CDX') or raw_cdx.startswith(' '): + counts['skip-cdx-raw'] += 1 + continue + cdx = parse_full_cdx_line(raw_cdx) + + if not ((cdx.status_code in ("200", "226") and cdx.mimetype in hit_mimetypes) + or (cdx.mimetype == "warc/revisit")): + counts['skip-cdx-scope'] += 1 + continue + + if cdx.mimetype == "application/octet-stream" and line.size_bytes and line.size_bytes != '-' and int(line.size_bytes) < 1000: + counts['skip-tiny-octetstream-'] += 1 + continue + + #print(time.time()) + final_row = lookup_referrer_row(m, cdx.url) + #print(time.time()) + if not final_row: + print("MISSING url: {}".format(raw_cdx.strip())) + counts['map-url-missing'] += 1 + continue + if not (final_row.status_code in ("200", "226") and final_row.mimetype in hit_mimetypes): + counts['skip-map-scope'] += 1 + continue + row = final_row + while row and row.referrer_url != None: + next_row = lookup_referrer_row(m, row.referrer_url) + if next_row: + row = next_row + else: + break + + initial_domain = urllib3.util.parse_url(row.url).host + final_domain = urllib3.util.parse_url(final_row.url).host + c.execute("INSERT INTO crawl_result VALUES (?,?,?,?,?,?,?,?,?,?,?,?)", + (row.url, None, initial_domain, final_row.breadcrumbs, final_row.url, final_domain, cdx.timestamp, final_row.status_code, cdx.sha1, final_row.mimetype, final_row.is_dedupe, True)) + #print(final_row.breadcrumbs) + i = i+1 + counts['inserted'] += 1 + if i % 2000 == 0: + print("... backward {}".format(i)) + output_db.commit() + + output_db.commit() + print("Building indices (this can be slow)...") + c.executescript(""" + CREATE INDEX IF NOT EXISTS result_initial_url on crawl_result (initial_url); + CREATE INDEX IF NOT EXISTS result_identifier on crawl_result (identifier); + """) + c.close() + m.close() + print("Backward map complete.") + print(counts) + return counts + +def backward(log_file, map_db, output_db, hit_mimetypes=FULLTEXT_MIMETYPES): + """ + This is a variant of backward_cdx that uses the log files, not CDX file + """ + print("Mapping backward from log file 200s to initial urls") + counts = collections.Counter({'inserted': 0}) + m = map_db.cursor() + create_out_table(output_db) + c = output_db.cursor() + i = 0 + for raw in log_file: + line = parse_crawl_line(raw) + if not line: + print("BAD LOG LINE: {}".format(raw.strip())) + continue + if line.url.startswith('dns:') or line.url.startswith('whois:'): + counts['skip-log-prereq'] += 1 + continue + is_dedupe = 'duplicate:digest' in line.annotations + + if not (line.status_code in ("200", "226") and line.mimetype in hit_mimetypes): + counts['skip-log-scope'] += 1 + continue + + if line.mimetype == "application/octet-stream" and int(line.size_bytes) < 1000: + counts['skip-tiny-octetstream-'] += 1 + continue + + #print(time.time()) + final_row = lookup_referrer_row(m, line.url) + #print(time.time()) + if not final_row: + print("MISSING url: {}".format(raw.strip())) + counts['map-url-missing'] += 1 + continue + if not (final_row.status_code in ("200", "226") and final_row.mimetype in hit_mimetypes): + counts['skip-map-scope'] += 1 + continue + row = final_row + while row and row.referrer_url != None: + next_row = lookup_referrer_row(m, row.referrer_url) + if next_row: + row = next_row + else: + break + + initial_domain = urllib3.util.parse_url(row.url).host + final_domain = urllib3.util.parse_url(final_row.url).host + # convert to IA CDX timestamp format + #final_timestamp = dateutil.parser.parse(line.timestamp).strftime("%Y%m%d%H%M%S") + final_timestamp = None + if len(line.timestamp) >= 12 and line.timestamp[4] != '-': + final_timestamp = line.timestamp[:12] + c.execute("INSERT INTO crawl_result VALUES (?,?,?,?,?,?,?,?,?,?,?,?)", + (row.url, None, initial_domain, final_row.breadcrumbs, final_row.url, final_domain, final_timestamp, final_row.status_code, line.sha1, final_row.mimetype, final_row.is_dedupe, True)) + #print(final_row.breadcrumbs) + i = i+1 + counts['inserted'] += 1 + if i % 2000 == 0: + print("... backward {}".format(i)) + output_db.commit() + + output_db.commit() + m.close() + print("Building indices (this can be slow)...") + c.executescript(""" + CREATE INDEX IF NOT EXISTS result_initial_url on crawl_result (initial_url); + CREATE INDEX IF NOT EXISTS result_identifier on crawl_result (identifier); + """) + c.close() + print("Backward map complete.") + print(counts) + return counts + +def forward(seed_id_file, map_db, output_db): + print("Mapping forwards from seedlist to terminal urls") + counts = collections.Counter({'inserted': 0}) + m = map_db.cursor() + create_out_table(output_db) + c = output_db.cursor() + + i = 0 + for raw_line in seed_id_file: + line = raw_line.split('\t') + if not line: + counts['skip-raw-line'] += 1 + continue + if len(line) == 1: + seed_url, identifier = line[0], None + elif len(line) == 2: + seed_url, identifier = line[0:2] + else: + print("WEIRD: {}".format(raw_line)) + assert len(line) <= 2 + raw_url = seed_url + seed_url = normalize_url(seed_url) + if not seed_url: + counts['skip-bad-seed-url'] += 1 + continue + if raw_url != seed_url: + counts['_normalized-seed-url'] += 1 + + # first check if entry already in output table; if so, only upsert with identifier + existing_row = list(c.execute('SELECT identifier, breadcrumbs from crawl_result WHERE initial_url=? LIMIT 1', [seed_url])) + if existing_row: + if not existing_row[0][0]: + # identifier hasn't been updated + c.execute('UPDATE crawl_result SET identifier=? WHERE initial_url=?', [identifier, seed_url]) + counts['existing-id-updated'] += 1 + continue + else: + counts['existing-complete'] += 1 + continue + + # if not, then do a "forward" lookup for the "best"/"final" terminal crawl line + # simple for redirect case (no branching); arbitrary for the fan-out case + first_row = lookup_referrer_row(m, seed_url) + if not first_row: + print("MISSING url: {}".format(raw_line.strip())) + # need to insert *something* in this case... + initial_domain = urllib3.util.parse_url(seed_url).host + c.execute("INSERT INTO crawl_result VALUES (?,?,?,?,?,?,?,?,?,?,?,?)", + (seed_url, identifier, initial_domain, None, None, None, None, None, None, None, None, False)) + counts['map-url-missing'] += 1 + continue + row = first_row + # recursively iterate down referal path + limit = 40 + while True: + limit = limit - 1 + if limit <= 0: + counts['_redirect-recursion-limit'] += 1 + break + next_rows = lookup_all_referred_rows(m, row.url) + if not next_rows: + # halt if we hit a dead end + break + + # there are going to be multiple referrer hits, need to chose among... based on status? + updated = False + for potential in next_rows: + if ('E' in potential.breadcrumbs or 'X' in potential.breadcrumbs or 'I' in potential.breadcrumbs) and not 'pdf' in potential.mimetype: + # TODO: still PDF-specific + # don't consider simple embeds unless PDF + continue + row = potential + updated = True + if not updated: + break + + final_row = row + initial_domain = urllib3.util.parse_url(seed_url).host + final_domain = urllib3.util.parse_url(final_row.url).host + # TODO: would pass SHA1 here if we had it? but not stored in referrer table + # XXX: None => timestamp + c.execute("INSERT INTO crawl_result VALUES (?,?,?,?,?,?,?,?,?,?,?,?)", + (seed_url, identifier, initial_domain, final_row.breadcrumbs, final_row.url, final_domain, None, final_row.status_code, None, final_row.mimetype, final_row.is_dedupe, False)) + #print(final_row.breadcrumbs) + i = i+1 + counts['inserted'] += 1 + if i % 2000 == 0: + print("... forward {}".format(i)) + output_db.commit() + + output_db.commit() + m.close() + print("Building indices (this can be slow)...") + c.executescript(""" + CREATE INDEX IF NOT EXISTS result_initial_url on crawl_result (initial_url); + CREATE INDEX IF NOT EXISTS result_identifier on crawl_result (identifier); + """) + c.close() + print("Forward map complete.") + print(counts) + return counts + +def everything(log_file, seed_id_file, map_db, output_db, hit_mimetypes=FULLTEXT_MIMETYPES): + referrer(open(log_file, 'r'), map_db) + bcounts = backward(open(log_file, 'r'), map_db, output_db, hit_mimetypes=hit_mimetypes) + fcounts = forward(seed_id_file, map_db, output_db) + print() + print("Everything complete!") + print(bcounts) + print(fcounts) + +def main(): + parser = argparse.ArgumentParser() + subparsers = parser.add_subparsers() + + sub_referrer = subparsers.add_parser('referrer') + sub_referrer.set_defaults(func=referrer) + sub_referrer.add_argument("log_file", + default=sys.stdin, type=argparse.FileType('rt')) + sub_referrer.add_argument("map_db_file", + type=str) + + sub_backward_cdx = subparsers.add_parser('backward_cdx') + sub_backward_cdx.set_defaults(func=backward_cdx) + sub_backward_cdx.add_argument("cdx_file", + default=sys.stdin, type=argparse.FileType('rt')) + sub_backward_cdx.add_argument("map_db_file", + type=str) + sub_backward_cdx.add_argument("output_db_file", + type=str) + + sub_backward = subparsers.add_parser('backward') + sub_backward.set_defaults(func=backward) + sub_backward.add_argument("log_file", + default=sys.stdin, type=argparse.FileType('rt')) + sub_backward.add_argument("map_db_file", + type=str) + sub_backward.add_argument("output_db_file", + type=str) + + sub_forward = subparsers.add_parser('forward') + sub_forward.set_defaults(func=forward) + sub_forward.add_argument("seed_id_file", + default=sys.stdin, type=argparse.FileType('rt')) + sub_forward.add_argument("map_db_file", + type=str) + sub_forward.add_argument("output_db_file", + type=str) + + sub_everything = subparsers.add_parser('everything') + sub_everything.set_defaults(func=everything) + sub_everything.add_argument("log_file", + type=str) + sub_everything.add_argument("seed_id_file", + default=sys.stdin, type=argparse.FileType('rt')) + sub_everything.add_argument("output_db_file", + type=str) + sub_everything.add_argument("--map_db_file", + default=":memory:", type=str) + + parser.add_argument("--html-hit", + action="store_true", + help="run in mode that considers only terminal HTML success") + + args = parser.parse_args() + if not args.__dict__.get("func"): + print("tell me what to do! (try --help)") + sys.exit(-1) + + if args.html_hit: + hit_mimetypes = ( + "text/html", + ) + else: + hit_mimetypes = FULLTEXT_MIMETYPES + + if args.func is referrer: + referrer(args.log_file, + sqlite3.connect(args.map_db_file, isolation_level='EXCLUSIVE')) + elif args.func is backward_cdx: + backward_cdx(args.cdx_file, + sqlite3.connect(args.map_db_file, isolation_level='EXCLUSIVE'), + sqlite3.connect(args.output_db_file, isolation_level='EXCLUSIVE'), + hit_mimetypes=hit_mimetypes) + elif args.func is backward: + backward(args.log_file, + sqlite3.connect(args.map_db_file, isolation_level='EXCLUSIVE'), + sqlite3.connect(args.output_db_file, isolation_level='EXCLUSIVE'), + hit_mimetypes=hit_mimetypes) + elif args.func is forward: + forward(args.seed_id_file, + sqlite3.connect(args.map_db_file, isolation_level='EXCLUSIVE'), + sqlite3.connect(args.output_db_file, isolation_level='EXCLUSIVE')) + elif args.func is everything: + everything(args.log_file, + args.seed_id_file, + sqlite3.connect(args.map_db_file), + sqlite3.connect(args.output_db_file, isolation_level='EXCLUSIVE'), + hit_mimetypes=hit_mimetypes) + else: + raise NotImplementedError + +if __name__ == '__main__': + main() + diff --git a/report_template.md b/report_template.md new file mode 100644 index 0000000..139598b --- /dev/null +++ b/report_template.md @@ -0,0 +1,108 @@ + +# Crawl QA Report + +This crawl report is auto-generated from a sqlite database file, which should be available/included. + +### Seedlist Stats + +```sql +SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT initial_domain) AS domains FROM crawl_result; +``` + +FTP seed URLs + +```sql +SELECT COUNT(*) as ftp_urls FROM crawl_result WHERE initial_url LIKE 'ftp://%'; +``` + +### Successful Hits + +```sql +SELECT COUNT(DISTINCT identifier) as identifiers, COUNT(DISTINCT initial_url) as uris, COUNT(DISTINCT final_sha1) as unique_sha1 FROM crawl_result WHERE hit=1; +``` + +De-duplication percentage (aka, fraction of hits where content had been crawled and identified previously): + +```sql +# AVG() hack! +SELECT 100. * AVG(final_was_dedupe) as percent FROM crawl_result WHERE hit=1; +``` + +Top mimetypes for successful hits (these are usually filtered to a fixed list in post-processing): + +```sql +SELECT final_mimetype, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_mimetype ORDER BY COUNT(*) DESC LIMIT 10; +``` + +Most popular breadcrumbs (a measure of how hard the crawler had to work): + +```sql +SELECT breadcrumbs, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY breadcrumbs ORDER BY COUNT(*) DESC LIMIT 10; +``` + +FTP vs. HTTP hits (200 is HTTP, 226 is FTP): + +```sql +SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=1 GROUP BY final_status_code LIMIT 10; +``` + +### Domain Summary + +Top *initial* domains: + +```sql +SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result) as percent FROM crawl_result GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20; +``` + +Top *successful, final* domains, where hits were found: + +```sql + +SELECT initial_domain, COUNT(*), 100. * COUNT(*) / (SELECT COUNT(*) FROM crawl_result WHERE hit=1) AS percent FROM crawl_result WHERE hit=1 GROUP BY initial_domain ORDER BY COUNT(*) DESC LIMIT 20; +``` + +Top *non-successful, final* domains where crawl paths terminated before a successful hit (but crawl did run): + +```sql +SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NOT NULL GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20; +``` + +Top *uncrawled, initial* domains, where the crawl didn't even attempt to run: + +```sql +SELECT initial_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code IS NULL GROUP BY initial_domain ORDER BY count(*) DESC LIMIT 20; +``` + +Top *blocked, final* domains: + +```sql +SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND (final_status_code='-61' OR final_status_code='-2') GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20; +``` + +Top *rate-limited, final* domains: + +```sql +SELECT final_domain, COUNT(*) FROM crawl_result WHERE hit=0 AND final_status_code='429' GROUP BY final_domain ORDER BY count(*) DESC LIMIT 20; +``` + +### Status Summary + +Top failure status codes: + +```sql + SELECT final_status_code, COUNT(*) FROM crawl_result WHERE hit=0 GROUP BY final_status_code ORDER BY count(*) DESC LIMIT 10; +``` + +### Example Results + +A handful of random success lines: + +```sql + SELECT identifier, initial_url, breadcrumbs, final_url, final_sha1, final_mimetype FROM crawl_result WHERE hit=1 ORDER BY random() LIMIT 10; +``` + +Handful of random non-success lines: + +```sql + SELECT identifier, initial_url, breadcrumbs, final_url, final_status_code, final_mimetype FROM crawl_result WHERE hit=0 ORDER BY random() LIMIT 25; +``` diff --git a/test.sqlite b/test.sqlite Binary files differnew file mode 100644 index 0000000..a0435e6 --- /dev/null +++ b/test.sqlite diff --git a/test.tsv b/test.tsv new file mode 100644 index 0000000..dbd9620 --- /dev/null +++ b/test.tsv @@ -0,0 +1,10 @@ +d6571a951347fb19c55a8d9d30b578e14b55be10 +f76971159f5c35b9c900eba23a757d47afd03fc9 +2fd5cc631fbac0cb6d737e357377ed482235487d +17a2bdb7ca5aff57b20bdb7b72e893fce00304a0 +d487d844f6b0403113f814cfd6669b5007a371a7 +516144dac67a47bf23c0c9fa8530e95e8093105d +35bb4705895240d3191e93601a16eb421bec850b +f16c0eb11deb87f6194df7930652432b483192bc +cfdeccc0a94df2e50316fbbd31b508eac14c9b15 +ba5d22f94bcf267a88d3f097d7b95f499c025c15 |