aboutsummaryrefslogtreecommitdiffstats
path: root/scalding/src/test/scala/example
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-03-10 22:40:00 -0700
committerBryan Newbold <bnewbold@archive.org>2020-03-10 23:01:20 -0700
commit8837977d2892beac6cf412f58dafcdbf06f323ac (patch)
tree40aef4358308348b4ef17d6913946711828b0eec /scalding/src/test/scala/example
parente7ba648fce4b8359358c6661b6ecb34576efc70d (diff)
downloadsandcrawler-8837977d2892beac6cf412f58dafcdbf06f323ac.tar.gz
sandcrawler-8837977d2892beac6cf412f58dafcdbf06f323ac.zip
url cleaning (canonicalization) for ingest base_url
As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
Diffstat (limited to 'scalding/src/test/scala/example')
0 files changed, 0 insertions, 0 deletions