aboutsummaryrefslogtreecommitdiffstats
path: root/proposals
Commit message (Collapse)AuthorAgeFilesLines
* proposals: update status; include some brainstorm-only docsBryan Newbold2023-01-0210-25/+62
|
* move top-level RFC to proposals dirBryan Newbold2022-12-231-0/+180
|
* 'trawling' proposal (in progress)Bryan Newbold2022-01-271-0/+177
|
* codespell fixes in proposalsBryan Newbold2021-11-248-16/+16
|
* sql: grobid_refs table JSON as 'JSON' not 'JSONB'Bryan Newbold2021-11-041-2/+2
| | | | | I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it.
* update grobid refs proposalBryan Newbold2021-11-041-10/+72
|
* initial proposal for GROBID refs table and pipelineBryan Newbold2021-11-041-0/+63
|
* sql: fixes to ingest_fileset_platform schema (from table creation)Bryan Newbold2021-11-011-6/+6
|
* commit SPN account changesBryan Newbold2021-10-151-0/+14
|
* persist support for ingest platform table, using existing persist workerBryan Newbold2021-10-151-2/+2
|
* document passing back platform_base_urlBryan Newbold2021-10-151-0/+1
|
* filesets: iteration of implementation and docsBryan Newbold2021-10-151-14/+19
|
* updates to fileset ingest proposalBryan Newbold2021-10-152-239/+337
|
* fileset ingest notesBryan Newbold2021-10-151-3/+23
|
* dataset ingest: start enumerating examplesBryan Newbold2021-10-151-0/+34
|
* initial dataset/fileset ingest proposalBryan Newbold2021-10-151-0/+185
|
* ingest: basic 'component' and 'src' supportBryan Newbold2021-10-042-0/+167
|
* crossref DB proposal, and include in SQL schemaBryan Newbold2021-06-021-0/+86
|
* update HTML ingest proposalBryan Newbold2020-12-231-1/+3
|
* html: update proposal (docs)Bryan Newbold2020-11-061-19/+49
|
* xml: re-encode XML docs into UTF-8 for persistingBryan Newbold2020-11-031-1/+18
|
* XML ingest proposalBryan Newbold2020-11-031-0/+64
|
* commit WIP HTML ingest proposalBryan Newbold2020-11-031-0/+97
|
* store no-capture URLs in terminal_urlBryan Newbold2020-10-121-0/+36
|
* seaweedfs proposal: fix typos and wordingMartin Czygan2020-07-011-9/+11
|
* tweak pdf_meta SQL schemaBryan Newbold2020-06-171-5/+5
|
* tweak kafka topic names and seaweedfs layoutBryan Newbold2020-06-171-3/+4
|
* pdf thumbnail+text+meta proposalBryan Newbold2020-06-171-0/+327
|
* Merge branch 'martin-seaweed-s3' into 'master'bnewbold2020-05-261-0/+424
|\ | | | | | | | | notes on seaweedfs (s3 backend) See merge request webgroup/sandcrawler!28
| * notes on seaweedfs (s3 backend)Martin Czygan2020-05-201-0/+424
| | | | | | | | Notes gathered during seaweedfs setup and test runs.
* | NSQ for job task manager/schedulerBryan Newbold2020-04-281-0/+79
|/
* ingest: add force_recrawl flag to skip historical wayback lookupBryan Newbold2020-03-021-0/+1
|
* move edit_extra path to top-levelBryan Newbold2020-02-181-2/+1
|
* include rel and oa_status in ingest request 'extra'Bryan Newbold2020-02-181-0/+4
|
* move pdf_trio results back under key in JSON/KafkaBryan Newbold2020-02-131-15/+18
|
* pdftrio JSON object as top-level in Kafka resultsBryan Newbold2020-02-121-16/+16
| | | | To be same as GROBID results
* pdftrio basic python codeBryan Newbold2020-02-121-2/+2
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* pdftrio proposal and start on schema+kafkaBryan Newbold2020-02-121-0/+101
|
* 2020q1 fulltext ingest plansBryan Newbold2020-01-291-0/+272
|
* clarify ingest result schema and semanticsBryan Newbold2020-01-151-23/+34
|
* clarify pmc/pmcid pairingBryan Newbold2020-01-141-3/+3
|
* yet more tweaks to ingest proposalBryan Newbold2020-01-021-3/+2
|
* update ingest proposal source/link namingBryan Newbold2019-12-131-16/+26
|
* sql schema change proposalsBryan Newbold2019-12-111-0/+40
|
* pdftotext proposalBryan Newbold2019-12-111-0/+123
|
* update ingest proposalBryan Newbold2019-12-111-11/+145
|
* add structure of ingest proposalBryan Newbold2019-11-131-0/+129
Still needs some details flushed out