index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
python
/
sandcrawler
Commit message (
Expand
)
Author
Age
Files
Lines
...
*
type annotations on SandcrawlerWorker
Bryan Newbold
2021-10-26
1
-46
/
+57
*
more progress on type annotations and linting
Bryan Newbold
2021-10-26
8
-49
/
+80
*
ia: more tweaks to delicate code to satisfy type checker
Bryan Newbold
2021-10-26
1
-10
/
+12
*
ia helpers: enforce max_redirects count correctly
Bryan Newbold
2021-10-26
1
-1
/
+1
*
set CDX request params are str, not int or datetime
Bryan Newbold
2021-10-26
1
-3
/
+6
*
bugfix: was setting 'from' parameter as a tuple, not a string
Bryan Newbold
2021-10-26
1
-1
/
+1
*
start type annotating IA helper code
Bryan Newbold
2021-10-26
1
-37
/
+65
*
start adding python type annotations to db and persist code
Bryan Newbold
2021-10-26
2
-97
/
+124
*
flake8 clean (with current settings)
Bryan Newbold
2021-10-26
7
-24
/
+22
*
start handling trivial lint cleanups: unused imports, 'is None', etc
Bryan Newbold
2021-10-26
15
-97
/
+57
*
make fmt
Bryan Newbold
2021-10-26
19
-571
/
+741
*
ingest_html: update trafilatura TEI-XML output kwarg
Bryan Newbold
2021-10-26
1
-1
/
+1
*
python: isort all imports
Bryan Newbold
2021-10-26
18
-99
/
+108
*
more small fileset ingest tweaks
Bryan Newbold
2021-10-26
2
-6
/
+21
*
persist support for ingest platform table, using existing persist worker
Bryan Newbold
2021-10-15
2
-2
/
+129
*
improve fileset ingest integration with file ingest
Bryan Newbold
2021-10-15
3
-5
/
+24
*
more fileset iteration
Bryan Newbold
2021-10-15
4
-45
/
+80
*
move SPNv2 'simple_get' logic to SPN client
Bryan Newbold
2021-10-15
3
-52
/
+31
*
filesets: iteration of implementation and docs
Bryan Newbold
2021-10-15
4
-82
/
+148
*
fileset ingest: improve platform parsing
Bryan Newbold
2021-10-15
1
-12
/
+196
*
fileset ingest: improve error handling
Bryan Newbold
2021-10-15
4
-48
/
+106
*
initial implementation of zenodo platform import
Bryan Newbold
2021-10-15
1
-0
/
+100
*
initial figshare platform helper
Bryan Newbold
2021-10-15
1
-0
/
+95
*
improvements to platform helpers
Bryan Newbold
2021-10-15
3
-34
/
+44
*
component ingest support for dataverse files (individual)
Bryan Newbold
2021-10-15
2
-13
/
+31
*
progress on web ingest strategy
Bryan Newbold
2021-10-15
3
-12
/
+121
*
fileset ingest progress for dataverse
Bryan Newbold
2021-10-15
4
-23
/
+291
*
local-file version of gen_file_metadata
Bryan Newbold
2021-10-15
2
-2
/
+43
*
progress on dataset ingest
Bryan Newbold
2021-10-15
4
-122
/
+333
*
wrap up previous renaming work
Bryan Newbold
2021-10-15
3
-5
/
+3
*
progress on fileset/dataset ingest
Bryan Newbold
2021-10-15
4
-0
/
+403
*
refactoring; progress on filesets
Bryan Newbold
2021-10-15
2
-1
/
+7
*
rename some python files for clarity
Bryan Newbold
2021-10-15
2
-0
/
+0
*
pdf ingest: journals.uchicago.edu pattern
Bryan Newbold
2021-10-11
1
-0
/
+8
*
spn: avoid 'None' job_id
Bryan Newbold
2021-10-11
1
-2
/
+2
*
ingest: basic 'component' and 'src' support
Bryan Newbold
2021-10-04
2
-20
/
+84
*
html ingest: report dt with broken CDX records
Bryan Newbold
2021-10-04
1
-1
/
+1
*
allow through unknown-scope HTML ingests, for possible SPN import
Bryan Newbold
2021-10-01
1
-11
/
+5
*
html: fix logging of broken CDX URL
Bryan Newbold
2021-10-01
1
-1
/
+1
*
ingest CDX lookup: weigh year+month of capture against in-petabox-or-not
Bryan Newbold
2021-09-30
1
-0
/
+1
*
fix typo with spn_cdx_retry_sec arg
Bryan Newbold
2021-09-30
1
-1
/
+1
*
tune SPN CDX retry/wait depending on mode (priority vs daily)
Bryan Newbold
2021-09-30
2
-3
/
+5
*
yet another bad PDF sha1
Bryan Newbold
2021-09-30
1
-0
/
+1
*
old HTML extractors: handle null tag
Bryan Newbold
2021-09-08
1
-8
/
+9
*
ingest: more block patterns, for huge databases
Bryan Newbold
2021-09-08
1
-1
/
+4
*
yet more PDF sha1 to skip
Bryan Newbold
2021-09-03
1
-0
/
+5
*
yet more PDF URL patterns
Bryan Newbold
2021-09-03
1
-0
/
+48
*
ingest: check URL blocklist again after redirects
Bryan Newbold
2021-09-03
1
-0
/
+7
*
refactor and expand wall/block/cookie URL patterns
Bryan Newbold
2021-09-03
1
-6
/
+25
*
HTML ingest: several more PDF fulltext URL patterns
Bryan Newbold
2021-09-03
1
-0
/
+87
[prev]
[next]