index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
python
/
sandcrawler
/
ia.py
Commit message (
Expand
)
Author
Age
Files
Lines
...
*
ingest: treat CDX lookup error as a wayback-error
Bryan Newbold
2020-02-24
1
-1
/
+4
*
fetch_petabox_body: allow non-200 status code fetches
Bryan Newbold
2020-02-24
1
-2
/
+10
*
allow fuzzy revisit matches
Bryan Newbold
2020-02-24
1
-1
/
+26
*
ingest: more revisit fixes
Bryan Newbold
2020-02-22
1
-4
/
+4
*
ia: improve warc/revisit implementation
Bryan Newbold
2020-02-22
1
-26
/
+46
*
cdx: handle empty/null CDX response
Bryan Newbold
2020-02-22
1
-0
/
+2
*
filter out CDX rows missing WARC playback fields
Bryan Newbold
2020-02-19
1
-0
/
+4
*
X-Archive-Src more robust than X-Archive-Redirect-Reason
Bryan Newbold
2020-02-18
1
-2
/
+3
*
wayback: on bad redirects, log instead of assert
Bryan Newbold
2020-02-18
1
-2
/
+13
*
attempt to work around corrupt ARC files from alexa issue
Bryan Newbold
2020-02-18
1
-0
/
+5
*
handle alternative dt format in WARC headers
Bryan Newbold
2020-02-05
1
-2
/
+4
*
decrease SPNv2 polling timeout to 3 minutes
Bryan Newbold
2020-02-05
1
-2
/
+2
*
improvements to reliability from prod testing
Bryan Newbold
2020-02-03
1
-5
/
+11
*
hack-y backoff ingest attempt
Bryan Newbold
2020-02-03
1
-2
/
+11
*
wayback: try to resolve HTTPException due to many HTTP headers
Bryan Newbold
2020-02-02
1
-1
/
+9
*
fix WaybackError exception formating
Bryan Newbold
2020-01-28
1
-1
/
+1
*
fix elif syntax error
Bryan Newbold
2020-01-28
1
-1
/
+1
*
clarify petabox fetch behavior
Bryan Newbold
2020-01-28
1
-3
/
+6
*
wayback: replay redirects have X-Archive-Redirect-Reason
Bryan Newbold
2020-01-21
1
-2
/
+4
*
handle UnicodeDecodeError in the other GET instance
Bryan Newbold
2020-01-15
1
-0
/
+2
*
increase SPNv2 polling timeout to 4 minutes
Bryan Newbold
2020-01-15
1
-1
/
+3
*
make failed replay fetch an error, not assert error
Bryan Newbold
2020-01-15
1
-1
/
+2
*
wayback replay: catch UnicodeDecodeError
Bryan Newbold
2020-01-15
1
-0
/
+2
*
pass through revisit_cdx
Bryan Newbold
2020-01-15
1
-5
/
+18
*
fix revisit resolution
Bryan Newbold
2020-01-15
1
-4
/
+12
*
SPNv2 doesn't support FTP; add a live test for non-revist FTP
Bryan Newbold
2020-01-14
1
-0
/
+10
*
basic FTP ingest support; revist record resolution
Bryan Newbold
2020-01-14
1
-34
/
+77
*
better print() output
Bryan Newbold
2020-01-10
1
-3
/
+3
*
fix redirect replay fetch method
Bryan Newbold
2020-01-10
1
-1
/
+4
*
handle SPNv2-then-CDX lookup failures
Bryan Newbold
2020-01-10
1
-6
/
+23
*
SPNv2 hack specifically for elsevier lookups
Bryan Newbold
2020-01-10
1
-0
/
+15
*
add support for redirect lookups from replay
Bryan Newbold
2020-01-10
1
-9
/
+69
*
more general ingest teaks and affordances
Bryan Newbold
2020-01-10
1
-5
/
+18
*
add sleep-and-retry workaround for CDX after SPNv2
Bryan Newbold
2020-01-10
1
-1
/
+9
*
more live tests (for regressions)
Bryan Newbold
2020-01-10
1
-0
/
+1
*
disable CDX best lookup 'collapse'; leave comment
Bryan Newbold
2020-01-10
1
-1
/
+3
*
hack: reverse sort of CDX exact seems broken with SPNv2 results
Bryan Newbold
2020-01-10
1
-1
/
+1
*
wayback: datetime mismatch as an error
Bryan Newbold
2020-01-09
1
-1
/
+2
*
lots of progress on wayback refactoring
Bryan Newbold
2020-01-09
1
-39
/
+123
*
location comes as a string, not list
Bryan Newbold
2020-01-09
1
-1
/
+1
*
fix http/https issue with GlobalWayback library
Bryan Newbold
2020-01-09
1
-1
/
+2
*
wayback fetch via replay; confirm hashes in crawl_resource()
Bryan Newbold
2020-01-09
1
-5
/
+40
*
wrap up basic (locally testable) ingest refactor
Bryan Newbold
2020-01-09
1
-19
/
+23
*
more wayback and SPN tests and fixes
Bryan Newbold
2020-01-09
1
-38
/
+152
*
refactor CdxApiClient, add tests
Bryan Newbold
2020-01-08
1
-40
/
+130
*
refactor SavePaperNowClient and add test
Bryan Newbold
2020-01-07
1
-28
/
+154
*
remove SPNv1 code paths
Bryan Newbold
2020-01-07
1
-35
/
+1
*
handle SPNv1 redirect loop
Bryan Newbold
2019-11-14
1
-0
/
+2
*
handle SPNv2 polling timeout
Bryan Newbold
2019-11-14
1
-6
/
+10
*
status_forcelist is on session, not request
Bryan Newbold
2019-11-13
1
-2
/
+2
[prev]
[next]