Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | update fatcat_file SQL table schema, and add backfill notes | Bryan Newbold | 2021-12-01 | 1 | -0/+13 | |
| | ||||||
* | commit old patch crawl notes | Bryan Newbold | 2021-12-01 | 1 | -0/+488 | |
| | ||||||
* | Revert "pipenv: update deps" | Bryan Newbold | 2021-12-01 | 2 | -574/+382 | |
| | | | | | | This reverts commit 7a5b203dbb37958a452eb1be3bd1bf8ed94cbbce. There is a problem with `internetarchive` 2.2.0, so reverting for now. | |||||
* | pipenv: update deps | Bryan Newbold | 2021-12-01 | 2 | -382/+574 | |
| | ||||||
* | add CDX sha1hex lookup/fetch helper script | Bryan Newbold | 2021-11-30 | 1 | -0/+170 | |
| | ||||||
* | sandcrawler SQL stats | Bryan Newbold | 2021-11-27 | 2 | -12/+425 | |
| | ||||||
* | codespell typos in README and original RFC | Bryan Newbold | 2021-11-24 | 2 | -2/+2 | |
| | ||||||
* | codespell typos in python (comments) | Bryan Newbold | 2021-11-24 | 5 | -5/+5 | |
| | ||||||
* | html_meta: actual typo in code (CSS selector) caught by codespell | Bryan Newbold | 2021-11-24 | 1 | -1/+1 | |
| | ||||||
* | codespell fixes in proposals | Bryan Newbold | 2021-11-24 | 8 | -16/+16 | |
| | ||||||
* | ingest tool: new backfill mode | Bryan Newbold | 2021-11-16 | 1 | -0/+76 | |
| | ||||||
* | make fmt | Bryan Newbold | 2021-11-16 | 1 | -1/+1 | |
| | ||||||
* | SPNv2: make 'resources' optional | Bryan Newbold | 2021-11-16 | 1 | -1/+1 | |
| | | | | | | | | This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline. | |||||
* | grobid: handle XML parsing errors, and have them recorded in sandcrawler-db | Bryan Newbold | 2021-11-12 | 1 | -1/+5 | |
| | ||||||
* | ingest_file: more efficient GROBID metadata copy | Bryan Newbold | 2021-11-12 | 1 | -3/+3 | |
| | ||||||
* | wrap up crossref refs backfill notes | Bryan Newbold | 2021-11-10 | 1 | -0/+47 | |
| | ||||||
* | grobid_tool: helper to process a single file | Bryan Newbold | 2021-11-10 | 1 | -0/+15 | |
| | ||||||
* | ingest: start re-processing GROBID with newer version | Bryan Newbold | 2021-11-10 | 1 | -2/+6 | |
| | ||||||
* | simple persist worker/tool to backfill grobid_refs | Bryan Newbold | 2021-11-10 | 2 | -0/+62 | |
| | ||||||
* | grobid: extract more metadata in document TEI-XML | Bryan Newbold | 2021-11-10 | 1 | -0/+5 | |
| | ||||||
* | grobid: update 'TODO' comment based on review | Bryan Newbold | 2021-11-04 | 1 | -3/+0 | |
| | ||||||
* | update crossref/grobid refs generation notes | Bryan Newbold | 2021-11-04 | 1 | -4/+96 | |
| | ||||||
* | crossref grobid refs: another error case (ReadTimeout) | Bryan Newbold | 2021-11-04 | 2 | -5/+11 | |
| | | | | | With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML). | |||||
* | db (postgrest): actually use an HTTP session | Bryan Newbold | 2021-11-04 | 1 | -12/+24 | |
| | | | | Not as important with GET as POST, I think, but still best practice. | |||||
* | grobid: use requests session | Bryan Newbold | 2021-11-04 | 1 | -3/+4 | |
| | | | | | | This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address')) | |||||
* | grobid crossref refs: try to handle HTTP 5xx and XML parse errors | Bryan Newbold | 2021-11-04 | 2 | -5/+33 | |
| | ||||||
* | grobid: handle weird whitespace unstructured from crossref | Bryan Newbold | 2021-11-04 | 1 | -1/+10 | |
| | | | | See also: https://github.com/kermitt2/grobid/issues/849 | |||||
* | crossref persist: batch size depends on whether parsing refs | Bryan Newbold | 2021-11-04 | 2 | -2/+8 | |
| | ||||||
* | sql: grobid_refs table JSON as 'JSON' not 'JSONB' | Bryan Newbold | 2021-11-04 | 2 | -3/+3 | |
| | | | | | I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it. | |||||
* | grobid refs backfill progress | Bryan Newbold | 2021-11-04 | 1 | -1/+43 | |
| | ||||||
* | record SQL table sizes at start of crossref re-ingest | Bryan Newbold | 2021-11-04 | 1 | -0/+19 | |
| | ||||||
* | start notes on crossref refs backfill | Bryan Newbold | 2021-11-04 | 1 | -0/+54 | |
| | ||||||
* | crossref persist: make GROBID ref parsing an option (not default) | Bryan Newbold | 2021-11-04 | 3 | -9/+33 | |
| | ||||||
* | add grobid_refs and crossref_with_refs to sandcrawler-db SQL schema | Bryan Newbold | 2021-11-04 | 1 | -0/+21 | |
| | ||||||
* | glue, utils, and worker code for crossref and grobid_refs | Bryan Newbold | 2021-11-04 | 4 | -5/+212 | |
| | ||||||
* | update grobid refs proposal | Bryan Newbold | 2021-11-04 | 1 | -10/+72 | |
| | ||||||
* | iterated GROBID citation cleaning and processing | Bryan Newbold | 2021-11-04 | 1 | -27/+45 | |
| | | | | Switched to using just 'key'/'id' for downstream matching. | |||||
* | grobid citations: first pass at cleaning unstructured | Bryan Newbold | 2021-11-04 | 1 | -2/+34 | |
| | ||||||
* | initial proposal for GROBID refs table and pipeline | Bryan Newbold | 2021-11-04 | 1 | -0/+63 | |
| | ||||||
* | initial crossref-refs via GROBID helper routine | Bryan Newbold | 2021-11-04 | 7 | -6/+839 | |
| | ||||||
* | pipenv: bump grobid_tei_xml version to 0.1.2 | Bryan Newbold | 2021-11-04 | 2 | -11/+11 | |
| | ||||||
* | pdftrio client: use HTTP session for POSTs | Bryan Newbold | 2021-11-03 | 1 | -1/+1 | |
| | ||||||
* | workers: use HTTP session for archive.org fetches | Bryan Newbold | 2021-11-03 | 1 | -3/+3 | |
| | ||||||
* | IA (wayback): actually use an HTTP session for replay fetches | Bryan Newbold | 2021-11-03 | 1 | -2/+3 | |
| | | | | | | | | I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput. | |||||
* | SPN reingest: 6 hour minimum, 6 month max | Bryan Newbold | 2021-11-03 | 1 | -2/+2 | |
| | ||||||
* | sql: fix typo in quarterly (not weekly) script | Bryan Newbold | 2021-11-03 | 1 | -1/+1 | |
| | ||||||
* | sql: fixes to ingest_fileset_platform schema (from table creation) | Bryan Newbold | 2021-11-01 | 2 | -12/+12 | |
| | ||||||
* | updates/corrections to old small.json GROBID metadata example file | Bryan Newbold | 2021-10-27 | 1 | -6/+1 | |
| | ||||||
* | remove grobid2json helper file, replace with grobid_tei_xml | Bryan Newbold | 2021-10-27 | 7 | -224/+22 | |
| | ||||||
* | small type annotation things from additional packages | Bryan Newbold | 2021-10-27 | 2 | -5/+14 | |
| |