aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2019_ingest.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-01-15 13:54:02 -0800
committerBryan Newbold <bnewbold@archive.org>2020-01-15 13:54:02 -0800
commitd06fd45e3c86cb080ad7724f3fc7575750a9cd69 (patch)
tree12862548aa14870af3c710076a6df8441f2ddb4e /proposals/2019_ingest.md
parent4d0224f3e73315ef4db39643e6d4851e4a466658 (diff)
downloadsandcrawler-d06fd45e3c86cb080ad7724f3fc7575750a9cd69.tar.gz
sandcrawler-d06fd45e3c86cb080ad7724f3fc7575750a9cd69.zip
clarify ingest result schema and semantics
Diffstat (limited to 'proposals/2019_ingest.md')
-rw-r--r--proposals/2019_ingest.md57
1 files changed, 34 insertions, 23 deletions
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md
index 751532a..0b569b0 100644
--- a/proposals/2019_ingest.md
+++ b/proposals/2019_ingest.md
@@ -112,29 +112,40 @@ HTML? Or both? Let's just recrawl.
- ...
*FileIngestResult*
- - request (object): the full IngestRequest, copied
- - terminal
- - url
- - status_code
- - wayback (XXX: ?)
- - datetime
- - archive_url
- - file_meta (same schema as sandcrawler-db table)
- - size_bytes
- - md5
- - sha1
- - sha256
- - mimetype
- - cdx (same schema as sandcrawler-db table)
- - grobid (same schema as sandcrawler-db table)
- - status
- - grobid_version
- - status_code
- - xml_url
- - fatcat_release (via biblio-glutton match)
- - metadata (JSON)
- - status (slug): 'success', 'error', etc
- - hit (boolean): whether we got something that looks like what was requested
+ - `request` (object): the full IngestRequest, copied
+ - `status` (slug): 'success', 'error', etc
+ - `hit` (boolean): whether we got something that looks like what was requested
+ - `terminal` (object): last crawled resource (if any)
+ - `terminal_url` (string; formerly `url`)
+ - `terminal_dt` (string): wayback capture datetime (string)
+ - `terminal_status_code`
+ - `terminal_sha1hex`: should match true `file_meta` SHA1 (not necessarily CDX SHA1)
+ (in case of transport encoding difference)
+ - `file_meta` (object): info about the terminal file
+ - same schema as sandcrawler-db table
+ - `size_bytes`
+ - `md5hex`
+ - `sha1hex`
+ - `sha256hex`
+ - `mimetype`: if not know, `application/octet-stream`
+ - `cdx`: CDX record matching terminal resource. *MAY* be a revisit or partial
+ record (eg, if via SPNv2)
+ - same schema as sandcrawler-db table
+ - `revisit_cdx` (optional): if `cdx` is a revisit record, this will be the
+ best "original" location for retrieval of the body (matching `flie_meta`)
+ - same schema as sandcrawler-db table
+ - `grobid`
+ - same schema as sandcrawler-db table
+ - `status` (string)
+ - `status_code` (int)
+ - `grobid_version` (string, from metadata)
+ - `fatcat_release` (string, from metadata)
+ - `metadata` (JSON) (with `grobid_version` and `fatcat_release` removed)
+ - NOT `tei_xml` (strip from reply)
+ - NOT `file_meta` (strip from reply)
+
+In general, it is the `terminal_dt` and `terminal_url` that should be used to
+construct wayback links (eg, for insertion to fatcat), not from the `cdx`.
## New SQL Tables