index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
proposals
Commit message (
Expand
)
Author
Age
Files
Lines
*
initial proposal for GROBID refs table and pipeline
Bryan Newbold
2021-11-04
1
-0
/
+63
*
sql: fixes to ingest_fileset_platform schema (from table creation)
Bryan Newbold
2021-11-01
1
-6
/
+6
*
commit SPN account changes
Bryan Newbold
2021-10-15
1
-0
/
+14
*
persist support for ingest platform table, using existing persist worker
Bryan Newbold
2021-10-15
1
-2
/
+2
*
document passing back platform_base_url
Bryan Newbold
2021-10-15
1
-0
/
+1
*
filesets: iteration of implementation and docs
Bryan Newbold
2021-10-15
1
-14
/
+19
*
updates to fileset ingest proposal
Bryan Newbold
2021-10-15
2
-239
/
+337
*
fileset ingest notes
Bryan Newbold
2021-10-15
1
-3
/
+23
*
dataset ingest: start enumerating examples
Bryan Newbold
2021-10-15
1
-0
/
+34
*
initial dataset/fileset ingest proposal
Bryan Newbold
2021-10-15
1
-0
/
+185
*
ingest: basic 'component' and 'src' support
Bryan Newbold
2021-10-04
2
-0
/
+167
*
crossref DB proposal, and include in SQL schema
Bryan Newbold
2021-06-02
1
-0
/
+86
*
update HTML ingest proposal
Bryan Newbold
2020-12-23
1
-1
/
+3
*
html: update proposal (docs)
Bryan Newbold
2020-11-06
1
-19
/
+49
*
xml: re-encode XML docs into UTF-8 for persisting
Bryan Newbold
2020-11-03
1
-1
/
+18
*
XML ingest proposal
Bryan Newbold
2020-11-03
1
-0
/
+64
*
commit WIP HTML ingest proposal
Bryan Newbold
2020-11-03
1
-0
/
+97
*
store no-capture URLs in terminal_url
Bryan Newbold
2020-10-12
1
-0
/
+36
*
seaweedfs proposal: fix typos and wording
Martin Czygan
2020-07-01
1
-9
/
+11
*
tweak pdf_meta SQL schema
Bryan Newbold
2020-06-17
1
-5
/
+5
*
tweak kafka topic names and seaweedfs layout
Bryan Newbold
2020-06-17
1
-3
/
+4
*
pdf thumbnail+text+meta proposal
Bryan Newbold
2020-06-17
1
-0
/
+327
*
Merge branch 'martin-seaweed-s3' into 'master'
bnewbold
2020-05-26
1
-0
/
+424
|
\
|
*
notes on seaweedfs (s3 backend)
Martin Czygan
2020-05-20
1
-0
/
+424
*
|
NSQ for job task manager/scheduler
Bryan Newbold
2020-04-28
1
-0
/
+79
|
/
*
ingest: add force_recrawl flag to skip historical wayback lookup
Bryan Newbold
2020-03-02
1
-0
/
+1
*
move edit_extra path to top-level
Bryan Newbold
2020-02-18
1
-2
/
+1
*
include rel and oa_status in ingest request 'extra'
Bryan Newbold
2020-02-18
1
-0
/
+4
*
move pdf_trio results back under key in JSON/Kafka
Bryan Newbold
2020-02-13
1
-15
/
+18
*
pdftrio JSON object as top-level in Kafka results
Bryan Newbold
2020-02-12
1
-16
/
+16
*
pdftrio basic python code
Bryan Newbold
2020-02-12
1
-2
/
+2
*
pdftrio proposal and start on schema+kafka
Bryan Newbold
2020-02-12
1
-0
/
+101
*
2020q1 fulltext ingest plans
Bryan Newbold
2020-01-29
1
-0
/
+272
*
clarify ingest result schema and semantics
Bryan Newbold
2020-01-15
1
-23
/
+34
*
clarify pmc/pmcid pairing
Bryan Newbold
2020-01-14
1
-3
/
+3
*
yet more tweaks to ingest proposal
Bryan Newbold
2020-01-02
1
-3
/
+2
*
update ingest proposal source/link naming
Bryan Newbold
2019-12-13
1
-16
/
+26
*
sql schema change proposals
Bryan Newbold
2019-12-11
1
-0
/
+40
*
pdftotext proposal
Bryan Newbold
2019-12-11
1
-0
/
+123
*
update ingest proposal
Bryan Newbold
2019-12-11
1
-11
/
+145
*
add structure of ingest proposal
Bryan Newbold
2019-11-13
1
-0
/
+129