index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
python
/
sandcrawler
/
__init__.py
Commit message (
Expand
)
Author
Age
Files
Lines
*
lint fixes
Bryan Newbold
2020-06-17
1
-1
/
+1
*
add new pdf workers/persisters
Bryan Newbold
2020-06-17
1
-2
/
+2
*
initial work on PDF extraction worker
Bryan Newbold
2020-06-16
1
-1
/
+1
*
rename KafkaGrobidSink -> KafkaCompressSink
Bryan Newbold
2020-06-16
1
-1
/
+1
*
url cleaning (canonicalization) for ingest base_url
Bryan Newbold
2020-03-10
1
-1
/
+1
*
persist: ingest_request tool (with no ingest_file_result)
Bryan Newbold
2020-03-05
1
-1
/
+1
*
pdftrio basic python code
Bryan Newbold
2020-02-12
1
-1
/
+2
*
small fixups to SandcrawlerPostgrestClient
Bryan Newbold
2020-01-14
1
-0
/
+1
*
more wayback and SPN tests and fixes
Bryan Newbold
2020-01-09
1
-1
/
+1
*
fix sandcrawler persist workers
Bryan Newbold
2020-01-02
1
-0
/
+1
*
have SPN client differentiate between SPN and remote errors
Bryan Newbold
2019-11-13
1
-1
/
+1
*
rename FileIngestWorker
Bryan Newbold
2019-11-13
1
-0
/
+1
*
lots of grobid tool implementation (still WIP)
Bryan Newbold
2019-09-26
1
-1
/
+4
*
re-write parse_cdx_line for sandcrawler lib
Bryan Newbold
2019-09-25
1
-1
/
+1
*
start refactoring sandcrawler python common code
Bryan Newbold
2019-09-23
1
-0
/
+3