aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* enqueue PLATFORM PDFs for crawlBryan Newbold2022-01-071-0/+23
|
* document progress on re-GROBID-ingBryan Newbold2022-01-051-0/+89
|
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4
|
* lint ('not in')Bryan Newbold2021-12-151-2/+2
|
* lint: ignore unused 'sentry_client'Bryan Newbold2021-12-151-1/+1
|
* fix type with --enable-sentryBryan Newbold2021-12-151-1/+1
|
* ingest tool: allow enabling sentry (for exception debugging)Bryan Newbold2021-12-151-0/+13
|
* more fileset ingest tweaksBryan Newbold2021-12-152-0/+7
|
* fileset ingest: more requests timeouts, sessionsBryan Newbold2021-12-153-37/+68
|
* fileset ingest: create tmp subdirectories if neededBryan Newbold2021-12-151-0/+5
|
* fileset ingest: configure IA session from envBryan Newbold2021-12-151-1/+6
| | | | | Note that this doesn't currently work for `upload()`, and as a work-around I created `~/.config/ia.ini` manually on the worker VM.
* pipenv: add pymupdf; update trafilaturaBryan Newbold2021-12-152-420/+644
|
* fileset ingest: actually use spn2 CLI flagBryan Newbold2021-12-112-3/+4
|
* notes on re-GROBID-ing (and re-extracting) some filestrawlerBryan Newbold2021-12-091-0/+289
|
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
|
* worker: add kafka_group_suffix optionBryan Newbold2021-12-071-3/+19
|
* ingest tool: allow configuration of GROBID endpointBryan Newbold2021-12-071-0/+7
|
* 2021-12-02 database table size statsBryan Newbold2021-12-071-0/+22
|
* sandcrawler SQL dump and upload updatesBryan Newbold2021-12-071-4/+12
|
* update fatcat_file SQL table schema, and add backfill notesBryan Newbold2021-12-071-1/+3
|
* update fatcat_file SQL table schema, and add backfill notesBryan Newbold2021-12-011-0/+13
|
* commit old patch crawl notesBryan Newbold2021-12-011-0/+488
|
* Revert "pipenv: update deps"Bryan Newbold2021-12-012-574/+382
| | | | | | This reverts commit 7a5b203dbb37958a452eb1be3bd1bf8ed94cbbce. There is a problem with `internetarchive` 2.2.0, so reverting for now.
* pipenv: update depsBryan Newbold2021-12-012-382/+574
|
* add CDX sha1hex lookup/fetch helper scriptBryan Newbold2021-11-301-0/+170
|
* sandcrawler SQL statsBryan Newbold2021-11-272-12/+425
|
* codespell typos in README and original RFCBryan Newbold2021-11-242-2/+2
|
* codespell typos in python (comments)Bryan Newbold2021-11-245-5/+5
|
* html_meta: actual typo in code (CSS selector) caught by codespellBryan Newbold2021-11-241-1/+1
|
* codespell fixes in proposalsBryan Newbold2021-11-248-16/+16
|
* ingest tool: new backfill modeBryan Newbold2021-11-161-0/+76
|
* make fmtBryan Newbold2021-11-161-1/+1
|
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
| | | | | | | | This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline.
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
|
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
|
* wrap up crossref refs backfill notesBryan Newbold2021-11-101-0/+47
|
* grobid_tool: helper to process a single fileBryan Newbold2021-11-101-0/+15
|
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
|
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-102-0/+62
|
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
|
* grobid: update 'TODO' comment based on reviewBryan Newbold2021-11-041-3/+0
|
* update crossref/grobid refs generation notesBryan Newbold2021-11-041-4/+96
|
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-042-5/+11
| | | | | With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
* db (postgrest): actually use an HTTP sessionBryan Newbold2021-11-041-12/+24
| | | | Not as important with GET as POST, I think, but still best practice.
* grobid: use requests sessionBryan Newbold2021-11-041-3/+4
| | | | | | This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-042-5/+33
|
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10
| | | | See also: https://github.com/kermitt2/grobid/issues/849
* crossref persist: batch size depends on whether parsing refsBryan Newbold2021-11-042-2/+8
|
* sql: grobid_refs table JSON as 'JSON' not 'JSONB'Bryan Newbold2021-11-042-3/+3
| | | | | I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it.
* grobid refs backfill progressBryan Newbold2021-11-041-1/+43
|