index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
python
/
sandcrawler
/
html_ingest.py
Commit message (
Collapse
)
Author
Age
Files
Lines
*
DOAJ and HTML ingest tweaks from QA run
Bryan Newbold
2020-11-10
1
-2
/
+2
|
*
html: handle more traf error cases
Bryan Newbold
2020-11-08
1
-2
/
+2
|
*
html: most small platform tweaks
Bryan Newbold
2020-11-08
1
-5
/
+4
|
*
move fuzzy URL match method to misc
Bryan Newbold
2020-11-08
1
-19
/
+1
|
*
html: more robust ingest; better platform and scope detection
Bryan Newbold
2020-11-08
1
-32
/
+96
|
*
html: small ingest improvements
Bryan Newbold
2020-11-08
1
-0
/
+4
|
*
html: start improving scope detection
Bryan Newbold
2020-11-08
1
-4
/
+48
|
*
gen_file_metadata: allow empty/null bodies (if flag set)
Bryan Newbold
2020-11-08
1
-1
/
+1
|
|
|
|
This is for HTML sub-resources, which can validly be empty (I think)
*
html: missing fetch is wayback-content-error, not wayback-error
Bryan Newbold
2020-11-08
1
-2
/
+2
|
*
html: handle no-capture for sub-resources
Bryan Newbold
2020-11-08
1
-8
/
+5
|
*
html: refactors/tweaks from testing
Bryan Newbold
2020-11-06
1
-12
/
+18
|
*
initial implementation of HTML ingest in existing worker
Bryan Newbold
2020-11-04
1
-7
/
+22
|
*
html: some refactoring
Bryan Newbold
2020-11-03
1
-13
/
+16
|
*
move transfer encoding helper to sandcrawler/ia.py
Bryan Newbold
2020-11-03
1
-22
/
+16
|
*
html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs
Bryan Newbold
2020-10-30
1
-3
/
+3
|
*
html: work around firstmonday DOCTYPE issue
Bryan Newbold
2020-10-30
1
-0
/
+3
|
*
html: more ingest improvements
Bryan Newbold
2020-10-30
1
-18
/
+118
|
*
html ingest: improve data flow
Bryan Newbold
2020-10-29
1
-18
/
+41
|
*
better default CLI output (show usage)
Bryan Newbold
2020-10-29
1
-1
/
+1
|
*
html: initial ingest implementation
Bryan Newbold
2020-10-29
1
-0
/
+193