index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
python
/
sandcrawler
/
__init__.py
Commit message (
Expand
)
Author
Age
Files
Lines
*
ingest spn2: fix tests
Bryan Newbold
2022-05-05
1
-0
/
+1
*
make fmt (black 21.9b0)
Bryan Newbold
2021-10-27
1
-10
/
+42
*
make fmt
Bryan Newbold
2021-10-26
1
-8
/
+10
*
python: isort all imports
Bryan Newbold
2021-10-26
1
-8
/
+11
*
local-file version of gen_file_metadata
Bryan Newbold
2021-10-15
1
-1
/
+1
*
refactoring; progress on filesets
Bryan Newbold
2021-10-15
1
-1
/
+2
*
differential wayback-error from wayback-content-error
Bryan Newbold
2020-10-21
1
-1
/
+1
*
lint fixes
Bryan Newbold
2020-06-17
1
-1
/
+1
*
add new pdf workers/persisters
Bryan Newbold
2020-06-17
1
-2
/
+2
*
initial work on PDF extraction worker
Bryan Newbold
2020-06-16
1
-1
/
+1
*
rename KafkaGrobidSink -> KafkaCompressSink
Bryan Newbold
2020-06-16
1
-1
/
+1
*
url cleaning (canonicalization) for ingest base_url
Bryan Newbold
2020-03-10
1
-1
/
+1
*
persist: ingest_request tool (with no ingest_file_result)
Bryan Newbold
2020-03-05
1
-1
/
+1
*
pdftrio basic python code
Bryan Newbold
2020-02-12
1
-1
/
+2
*
small fixups to SandcrawlerPostgrestClient
Bryan Newbold
2020-01-14
1
-0
/
+1
*
more wayback and SPN tests and fixes
Bryan Newbold
2020-01-09
1
-1
/
+1
*
fix sandcrawler persist workers
Bryan Newbold
2020-01-02
1
-0
/
+1
*
have SPN client differentiate between SPN and remote errors
Bryan Newbold
2019-11-13
1
-1
/
+1
*
rename FileIngestWorker
Bryan Newbold
2019-11-13
1
-0
/
+1
*
lots of grobid tool implementation (still WIP)
Bryan Newbold
2019-09-26
1
-1
/
+4
*
re-write parse_cdx_line for sandcrawler lib
Bryan Newbold
2019-09-25
1
-1
/
+1
*
start refactoring sandcrawler python common code
Bryan Newbold
2019-09-23
1
-0
/
+3