status: implemented Fileset Ingest Pipeline (for Datasets) ====================================== Sandcrawler currently has ingest support for individual files saved as `file` entities in fatcat (xml and pdf ingest types) and HTML files with sub-components saved as `webcapture` entities in fatcat (html ingest type). This document describes extensions to this ingest system to flexibly support groups of files, which may be represented in fatcat as `fileset` entities. The main new ingest type is `dataset`. Compared to the existing ingest process, there are two major complications with datasets: - the ingest process often requires more than parsing HTML files, and will be specific to individual platforms and host software packages - the storage backend and fatcat entity type is flexible: a dataset might be represented by a single file, multiple files combined in to a single .zip file, or mulitple separate files; the data may get archived in wayback or in an archive.org item The new concepts of "strategy" and "platform" are introduced to accomodate these complications. ## Ingest Strategies The ingest strategy describes the fatcat entity type that will be output; the storage backend used; and whether an enclosing file format is used. The strategy to use can not be determined until the number and size of files is known. It is a function of file count, total file size, and publication platform. Strategy names are compact strings with the format `{storage_backend}-{fatcat_entity}`. A `-bundled` suffix after a `fileset` entity type indicates that metadata about multiple files is retained, but that in the storage backend only a single enclosing file (eg, `.zip`) will be stored. The supported strategies are: - `web-file`: single file of any type, stored in wayback, represented as fatcat `file` - `web-fileset`: multiple files of any type, stored in wayback, represented as fatcat `fileset` - `web-fileset-bundled`: single bundle file, stored in wayback, represented as fatcat `fileset` - `archiveorg-file`: single file of any type, stored in archive.org item, represented as fatcat `file` - `archiveorg-fileset`: multiple files of any type, stored in archive.org item, represented as fatcat `fileset` - `archiveorg-fileset-bundled`: single bundle file, stored in archive.org item, represented as fatcat `fileset` "Bundle" or "enclosing" files are things like .zip or .tar.gz. Not all .zip files are handled as bundles! Only when the transfer from the hosting platform is via a "download all as .zip" (or similar) do we consider a zipfile a "bundle" and index the interior files as a fileset. The term "bundle file" is used over "archive file" or "container file" to prevent confusion with the other use of those terms in the context of fatcat (container entities; archive; Internet Archive as an organiztion). The motivation for supporting both `web` and `archiveorg` is that `web` is somewhat simpler for small files, but `archiveorg` is better for larger groups of files (say more than 20) and larger total size (say more than 1 GByte total, or 128 MByte for any one file). The motivation for supporting "bundled" filesets is that there is only a single file to archive. ## Ingest Pseudocode 1. Determine `platform`, which may involve resolving redirects and crawling a landing page. a. currently we always crawl the ingest `base_url`, capturing a platform landing page b. we don't currently handle the case of `base_url` leading to a non-HTML terminal resource. the `component` ingest type does handle this 2. Use platform-specific methods to fetch manifest metadata and decide on an `ingest_strategy`. a. depending on platform, may include access URLs for multiple strategies (eg, URL for each file and a bundle URL), metadata about the item for, eg, archive.org item upload, etc 3. Use strategy-specific methods to archive all files in platform manifest, and verify manifest metadata. 4. Summarize status and return structured result metadata. a. if the strategy was `web-file` or `archiveorg-file`, potentially submit an `ingest_file_result` object down the file ingest pipeline (Kafka topic and later persist and fatcat import workers), with `dataset-file` ingest type (or `{ingest_type}-file` more generally). New python types: FilesetManifestFile path: str size: Optional[int] md5: Optional[str] sha1: Optional[str] sha256: Optional[str] mimetype: Optional[str] extra: Optional[Dict[str, Any]] status: Optional[str] platform_url: Optional[str] terminal_url: Optional[str] terminal_dt: Optional[str] FilesetPlatformItem platform_name: str platform_status: str platform_domain: Optional[str] platform_id: Optional[str] manifest: Optional[List[FilesetManifestFile]] archiveorg_item_name: Optional[str] archiveorg_item_meta web_base_url web_bundle_url ArchiveStrategyResult ingest_strategy: str status: str manifest: List[FilesetManifestFile] file_file_meta: Optional[dict] file_terminal: Optional[dict] file_cdx: Optional[dict] bundle_file_meta: Optional[dict] bundle_terminal: Optional[dict] bundle_cdx: Optional[dict] bundle_archiveorg_path: Optional[dict] New python APIs/classes: FilesetPlatformHelper match_request(request, resource, html_biblio) -> bool does the request and landing page metadata indicate a match for this platform? process_request(request, resource, html_biblio) -> FilesetPlatformItem do API requests, parsing, etc to fetch metadata and access URLs for this fileset/dataset. platform-specific chose_strategy(item: FilesetPlatformItem) -> IngestStrategy select an archive strategy for the given fileset/dataset FilesetIngestStrategy check_existing(item: FilesetPlatformItem) -> Optional[ArchiveStrategyResult] check the given backend for an existing capture/archive; if found, return result process(item: FilesetPlatformItem) -> ArchiveStrategyResult perform an actual archival capture ## Limits and Failure Modes - `too-large-size`: total size of the fileset is too large for archiving. initial limit is 64 GBytes, controlled by `max_total_size` parameter. - `too-many-files`: number of files (and thus file-level metadata) is too large. initial limit is 200, controlled by `max_file_count` parameter. - `platform-scope / FilesetPlatformScopeError`: for when `base_url` leads to a valid platform, which could be found via API or parsing, but has the wrong scope. Eg, tried to fetch a dataset, but got a DOI which represents all versions of the dataset, not a specific version. - `platform-restricted`/`PlatformRestrictedError`: for, eg, embargos - `platform-404`: got to a landing page, and seemed like in-scope, but no platform record found anyways ## New Sandcrawler Code and Worker sandcrawler-ingest-fileset-worker@{1..6} (or up to 1..12 later) Worker consumes from ingest request topic, produces to fileset ingest results, and optionally produces to file ingest results. sandcrawler-persist-ingest-fileset-worker@1 Simply writes fileset ingest rows to SQL. ## New Fatcat Worker and Code Changes fatcat-import-ingest-fileset-worker This importer is modeled on file and web worker. Filters for `success` with strategy of `*-fileset*`. Existing `fatcat-import-ingest-file-worker` should be updated to allow `dataset` single-file imports, with largely same behavior and semantics as current importer (`component` mode). Existing fatcat transforms, and possibly even elasticsearch schemas, should be updated to include fileset status and `in_ia` flag for dataset type releases. Existing entity updates worker submits `dataset` type ingests to ingest request topic. ## Ingest Result Schema Common with file results, and mostly relating to landing page HTML: hit: bool status: str success success-existing success-file (for `web-file` or `archiveorg-file` only) request: object terminal: object file_meta: object cdx: object revisit_cdx: object html_biblio: object Additional fileset-specific fields: manifest: list of objects platform_name: str platform_domain: str platform_id: str ingest_strategy: str archiveorg_item_name: str (optional, only for `archiveorg-*` strategies) file_count: int total_size: int fileset_bundle (optional, only for `*-fileset-bundle` strategy) file_meta cdx revisit_cdx terminal archiveorg_bundle_path fileset_file (optional, only for `*-file` strategy) file_meta terminal cdx revisit_cdx If the strategy was `web-file` or `archiveorg-file` and the status is `success-file`, then an ingest file result will also be published to `sandcrawler-ENV.ingest-file-results`, using the same ingest type and fields as regular ingest. All fileset ingest results get published to ingest-fileset-result. Existing sandcrawler persist workers also subscribe to this topic and persist status and landing page terminal info to tables just like with file ingest. GROBID, HTML, and other metadata is not persisted in this path. If the ingest strategy was a single file (`*-file`), then an ingest file is also published to the ingest-file-result topic, with the `fileset_file` metadata, and ingest type `dataset-file`. This should only happen on success condition. ## New SQL Tables Note that this table *complements* `ingest_file_result`, doesn't replace it. `ingest_file_result` could more accurately be called `ingest_result`. CREATE TABLE IF NOT EXISTS ingest_fileset_platform ( ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1), base_url TEXT NOT NULL CHECK (octet_length(base_url) >= 1), updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, hit BOOLEAN NOT NULL, status TEXT CHECK (octet_length(status) >= 1), platform_name TEXT NOT NULL CHECK (octet_length(platform) >= 1), platform_domain TEXT NOT NULL CHECK (octet_length(platform_domain) >= 1), platform_id TEXT NOT NULL CHECK (octet_length(platform_id) >= 1), ingest_strategy TEXT CHECK (octet_length(ingest_strategy) >= 1), total_size BIGINT, file_count INT, archiveorg_item_name TEXT CHECK (octet_length(item_name) >= 1), archiveorg_item_bundle_path TEXT CHECK (octet_length(item_path_bundle) >= 1), web_bundle_url TEXT CHECK (octet_length(terminal_url) >= 1), web_bundle_dt TEXT CHECK (octet_length(terminal_dt) = 14), manifest JSONB, -- list, similar to fatcat fileset manifest, plus extra: -- status (str) -- path (str) -- size (int) -- md5 (str) -- sha1 (str) -- sha256 (str) -- mimetype (str) -- extra (dict) -- platform_url (str) -- terminal_url (str) -- terminal_dt (str) PRIMARY KEY (ingest_type, base_url) ); CREATE INDEX ingest_fileset_platform_name_domain_id_idx ON ingest_fileset_platform(platform_name, platform_domain, platform_id); Persist worker should only insert in to this table if `platform_name`, `platform_domain`, and `platform_id` are extracted successfully. ## New Kafka Topic sandcrawler-ENV.ingest-fileset-results 6x, no retention limit ## Implementation Plan First implement ingest worker, including platform and strategy helpers, and test those as simple stdin/stdout CLI tools in sandcrawler repo to validate this proposal. Second implement fatcat importer and test locally and/or in QA. Lastly implement infrastructure, automation, and other "glue": - SQL schema - persist worker ## Design Note: Single-File Datasets Should datasets and other groups of files which only contain a single file get imported as a fatcat `file` or `fileset`? This can be broken down further as documents (single PDF) vs other individual files. Advantages of `file`: - handles case of article PDFs being marked as dataset accidentally - `file` entities get de-duplicated with simple lookup (eg, on `sha1`) - conceptually simpler if individual files are `file` entity - easier to download individual files Advantages of `fileset`: - conceptually simpler if all `dataset` entities have `fileset` form factor - code path is simpler: one fewer strategy, and less complexity of sending files down separate import path - metadata about platform is retained - would require no modification of existing fatcat file importer - fatcat import of archive.org of `file` is not actually implemented yet? Decision is to do individual files. Fatcat fileset import worker should reject single-file (and empty) manifest filesets. Fatcat file import worker should accept all mimetypes for `dataset-file` (similar to `component`). ## Example Entities See `notes/dataset_examples.txt`