diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-03-10 22:40:00 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-03-10 23:01:20 -0700 |
commit | 8837977d2892beac6cf412f58dafcdbf06f323ac (patch) | |
tree | 40aef4358308348b4ef17d6913946711828b0eec /python/sandcrawler/__init__.py | |
parent | e7ba648fce4b8359358c6661b6ecb34576efc70d (diff) | |
download | sandcrawler-8837977d2892beac6cf412f58dafcdbf06f323ac.tar.gz sandcrawler-8837977d2892beac6cf412f58dafcdbf06f323ac.zip |
url cleaning (canonicalization) for ingest base_url
As mentioned in comment, this first version does not re-write the URL in
the `base_url` field. If we did so, then ingest_request rows would not
SQL JOIN to ingest_file_result rows, which we wouldn't want.
In the future, behaviour should maybe be to refuse to process URLs that
aren't clean (eg, if base_url != clean_url(base_url)) and return a
'bad-url' status or soemthing. Then we would only accept clean URLs in
both tables, and clear out all old/bad URLs with a cleanup script.
Diffstat (limited to 'python/sandcrawler/__init__.py')
-rw-r--r-- | python/sandcrawler/__init__.py | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/python/sandcrawler/__init__.py b/python/sandcrawler/__init__.py index 3d49096..492b558 100644 --- a/python/sandcrawler/__init__.py +++ b/python/sandcrawler/__init__.py @@ -1,7 +1,7 @@ from .grobid import GrobidClient, GrobidWorker, GrobidBlobWorker from .pdftrio import PdfTrioClient, PdfTrioWorker, PdfTrioBlobWorker -from .misc import gen_file_metadata, b32_hex, parse_cdx_line, parse_cdx_datetime +from .misc import gen_file_metadata, b32_hex, parse_cdx_line, parse_cdx_datetime, clean_url from .workers import KafkaSink, KafkaGrobidSink, JsonLinePusher, CdxLinePusher, CdxLinePusher, KafkaJsonPusher, BlackholeSink, ZipfilePusher, MultiprocessWrapper from .ia import WaybackClient, WaybackError, CdxApiClient, CdxApiError, SavePageNowClient, SavePageNowError, PetaboxError, ResourceResult, WarcResource, CdxPartial, CdxRow from .ingest import IngestFileWorker |