aboutsummaryrefslogtreecommitdiffstats
path: root/proposals
Commit message (Collapse)AuthorAgeFilesLines
* dataset ingest: start enumerating examplesBryan Newbold2021-10-151-0/+34
|
* initial dataset/fileset ingest proposalBryan Newbold2021-10-151-0/+185
|
* ingest: basic 'component' and 'src' supportBryan Newbold2021-10-042-0/+167
|
* crossref DB proposal, and include in SQL schemaBryan Newbold2021-06-021-0/+86
|
* update HTML ingest proposalBryan Newbold2020-12-231-1/+3
|
* html: update proposal (docs)Bryan Newbold2020-11-061-19/+49
|
* xml: re-encode XML docs into UTF-8 for persistingBryan Newbold2020-11-031-1/+18
|
* XML ingest proposalBryan Newbold2020-11-031-0/+64
|
* commit WIP HTML ingest proposalBryan Newbold2020-11-031-0/+97
|
* store no-capture URLs in terminal_urlBryan Newbold2020-10-121-0/+36
|
* seaweedfs proposal: fix typos and wordingMartin Czygan2020-07-011-9/+11
|
* tweak pdf_meta SQL schemaBryan Newbold2020-06-171-5/+5
|
* tweak kafka topic names and seaweedfs layoutBryan Newbold2020-06-171-3/+4
|
* pdf thumbnail+text+meta proposalBryan Newbold2020-06-171-0/+327
|
* Merge branch 'martin-seaweed-s3' into 'master'bnewbold2020-05-261-0/+424
|\ | | | | | | | | notes on seaweedfs (s3 backend) See merge request webgroup/sandcrawler!28
| * notes on seaweedfs (s3 backend)Martin Czygan2020-05-201-0/+424
| | | | | | | | Notes gathered during seaweedfs setup and test runs.
* | NSQ for job task manager/schedulerBryan Newbold2020-04-281-0/+79
|/
* ingest: add force_recrawl flag to skip historical wayback lookupBryan Newbold2020-03-021-0/+1
|
* move edit_extra path to top-levelBryan Newbold2020-02-181-2/+1
|
* include rel and oa_status in ingest request 'extra'Bryan Newbold2020-02-181-0/+4
|
* move pdf_trio results back under key in JSON/KafkaBryan Newbold2020-02-131-15/+18
|
* pdftrio JSON object as top-level in Kafka resultsBryan Newbold2020-02-121-16/+16
| | | | To be same as GROBID results
* pdftrio basic python codeBryan Newbold2020-02-121-2/+2
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* pdftrio proposal and start on schema+kafkaBryan Newbold2020-02-121-0/+101
|
* 2020q1 fulltext ingest plansBryan Newbold2020-01-291-0/+272
|
* clarify ingest result schema and semanticsBryan Newbold2020-01-151-23/+34
|
* clarify pmc/pmcid pairingBryan Newbold2020-01-141-3/+3
|
* yet more tweaks to ingest proposalBryan Newbold2020-01-021-3/+2
|
* update ingest proposal source/link namingBryan Newbold2019-12-131-16/+26
|
* sql schema change proposalsBryan Newbold2019-12-111-0/+40
|
* pdftotext proposalBryan Newbold2019-12-111-0/+123
|
* update ingest proposalBryan Newbold2019-12-111-11/+145
|
* add structure of ingest proposalBryan Newbold2019-11-131-0/+129
Still needs some details flushed out