Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | 'trawling' proposal (in progress) | Bryan Newbold | 2022-01-27 | 1 | -0/+177 |
| | |||||
* | codespell fixes in proposals | Bryan Newbold | 2021-11-24 | 8 | -16/+16 |
| | |||||
* | sql: grobid_refs table JSON as 'JSON' not 'JSONB' | Bryan Newbold | 2021-11-04 | 1 | -2/+2 |
| | | | | | I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it. | ||||
* | update grobid refs proposal | Bryan Newbold | 2021-11-04 | 1 | -10/+72 |
| | |||||
* | initial proposal for GROBID refs table and pipeline | Bryan Newbold | 2021-11-04 | 1 | -0/+63 |
| | |||||
* | sql: fixes to ingest_fileset_platform schema (from table creation) | Bryan Newbold | 2021-11-01 | 1 | -6/+6 |
| | |||||
* | commit SPN account changes | Bryan Newbold | 2021-10-15 | 1 | -0/+14 |
| | |||||
* | persist support for ingest platform table, using existing persist worker | Bryan Newbold | 2021-10-15 | 1 | -2/+2 |
| | |||||
* | document passing back platform_base_url | Bryan Newbold | 2021-10-15 | 1 | -0/+1 |
| | |||||
* | filesets: iteration of implementation and docs | Bryan Newbold | 2021-10-15 | 1 | -14/+19 |
| | |||||
* | updates to fileset ingest proposal | Bryan Newbold | 2021-10-15 | 2 | -239/+337 |
| | |||||
* | fileset ingest notes | Bryan Newbold | 2021-10-15 | 1 | -3/+23 |
| | |||||
* | dataset ingest: start enumerating examples | Bryan Newbold | 2021-10-15 | 1 | -0/+34 |
| | |||||
* | initial dataset/fileset ingest proposal | Bryan Newbold | 2021-10-15 | 1 | -0/+185 |
| | |||||
* | ingest: basic 'component' and 'src' support | Bryan Newbold | 2021-10-04 | 2 | -0/+167 |
| | |||||
* | crossref DB proposal, and include in SQL schema | Bryan Newbold | 2021-06-02 | 1 | -0/+86 |
| | |||||
* | update HTML ingest proposal | Bryan Newbold | 2020-12-23 | 1 | -1/+3 |
| | |||||
* | html: update proposal (docs) | Bryan Newbold | 2020-11-06 | 1 | -19/+49 |
| | |||||
* | xml: re-encode XML docs into UTF-8 for persisting | Bryan Newbold | 2020-11-03 | 1 | -1/+18 |
| | |||||
* | XML ingest proposal | Bryan Newbold | 2020-11-03 | 1 | -0/+64 |
| | |||||
* | commit WIP HTML ingest proposal | Bryan Newbold | 2020-11-03 | 1 | -0/+97 |
| | |||||
* | store no-capture URLs in terminal_url | Bryan Newbold | 2020-10-12 | 1 | -0/+36 |
| | |||||
* | seaweedfs proposal: fix typos and wording | Martin Czygan | 2020-07-01 | 1 | -9/+11 |
| | |||||
* | tweak pdf_meta SQL schema | Bryan Newbold | 2020-06-17 | 1 | -5/+5 |
| | |||||
* | tweak kafka topic names and seaweedfs layout | Bryan Newbold | 2020-06-17 | 1 | -3/+4 |
| | |||||
* | pdf thumbnail+text+meta proposal | Bryan Newbold | 2020-06-17 | 1 | -0/+327 |
| | |||||
* | Merge branch 'martin-seaweed-s3' into 'master' | bnewbold | 2020-05-26 | 1 | -0/+424 |
|\ | | | | | | | | | notes on seaweedfs (s3 backend) See merge request webgroup/sandcrawler!28 | ||||
| * | notes on seaweedfs (s3 backend) | Martin Czygan | 2020-05-20 | 1 | -0/+424 |
| | | | | | | | | Notes gathered during seaweedfs setup and test runs. | ||||
* | | NSQ for job task manager/scheduler | Bryan Newbold | 2020-04-28 | 1 | -0/+79 |
|/ | |||||
* | ingest: add force_recrawl flag to skip historical wayback lookup | Bryan Newbold | 2020-03-02 | 1 | -0/+1 |
| | |||||
* | move edit_extra path to top-level | Bryan Newbold | 2020-02-18 | 1 | -2/+1 |
| | |||||
* | include rel and oa_status in ingest request 'extra' | Bryan Newbold | 2020-02-18 | 1 | -0/+4 |
| | |||||
* | move pdf_trio results back under key in JSON/Kafka | Bryan Newbold | 2020-02-13 | 1 | -15/+18 |
| | |||||
* | pdftrio JSON object as top-level in Kafka results | Bryan Newbold | 2020-02-12 | 1 | -16/+16 |
| | | | | To be same as GROBID results | ||||
* | pdftrio basic python code | Bryan Newbold | 2020-02-12 | 1 | -2/+2 |
| | | | | This is basically just a copy/paste of GROBID code, only simpler! | ||||
* | pdftrio proposal and start on schema+kafka | Bryan Newbold | 2020-02-12 | 1 | -0/+101 |
| | |||||
* | 2020q1 fulltext ingest plans | Bryan Newbold | 2020-01-29 | 1 | -0/+272 |
| | |||||
* | clarify ingest result schema and semantics | Bryan Newbold | 2020-01-15 | 1 | -23/+34 |
| | |||||
* | clarify pmc/pmcid pairing | Bryan Newbold | 2020-01-14 | 1 | -3/+3 |
| | |||||
* | yet more tweaks to ingest proposal | Bryan Newbold | 2020-01-02 | 1 | -3/+2 |
| | |||||
* | update ingest proposal source/link naming | Bryan Newbold | 2019-12-13 | 1 | -16/+26 |
| | |||||
* | sql schema change proposals | Bryan Newbold | 2019-12-11 | 1 | -0/+40 |
| | |||||
* | pdftotext proposal | Bryan Newbold | 2019-12-11 | 1 | -0/+123 |
| | |||||
* | update ingest proposal | Bryan Newbold | 2019-12-11 | 1 | -11/+145 |
| | |||||
* | add structure of ingest proposal | Bryan Newbold | 2019-11-13 | 1 | -0/+129 |
Still needs some details flushed out |