aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2021-09-09_dataset_ingest.md
diff options
context:
space:
mode:
Diffstat (limited to 'proposals/2021-09-09_dataset_ingest.md')
-rw-r--r--proposals/2021-09-09_dataset_ingest.md239
1 files changed, 0 insertions, 239 deletions
diff --git a/proposals/2021-09-09_dataset_ingest.md b/proposals/2021-09-09_dataset_ingest.md
deleted file mode 100644
index cbfeb68..0000000
--- a/proposals/2021-09-09_dataset_ingest.md
+++ /dev/null
@@ -1,239 +0,0 @@
-
-Dataset Ingest Pipeline
-=======================
-
-Sandcrawler currently has ingest support for individual files saved as `file`
-entities in fatcat (xml and pdf ingest types) and HTML files with
-sub-components saved as `webcapture` entities in fatcat (html ingest type).
-
-This document describes extensions to this ingest system to flexibly support
-groups of files, which may be represented in fatcat as `fileset` entities. The
-new ingest type is `dataset`.
-
-Compared to the existing ingest process, there are two major complications with
-datasets:
-
-- the ingest process often requires more than parsing HTML files, and will be
- specific to individual platforms and host software packages
-- the storage backend and fatcat entity type is flexible: a dataset might be
- represented by a single file, multiple files combined in to a single .zip
- file, or mulitple separate files; the data may get archived in wayback or in
- an archive.org item
-
-The new concepts of "strategy" and "platform" are introduced to accomodate
-these complications.
-
-
-## Ingest Strategies
-
-The ingest strategy describes the fatcat entity type that will be output; the
-storage backend used; and whether an enclosing file format is used. The
-strategy to use can not be determined until the number and size of files is
-known. It is a function of file count, total file size, and platform.
-
-Strategy names are compact strings with the format
-`{storage_backend}-{fatcat_entity}`. A `-bundled` suffix after a `fileset`
-entity type indicates that metadata about multiple files is retained, but that
-in the storage backend only a single enclosing file (eg, `.zip`) will be
-stored.
-
-The supported strategies are:
-
-- `web-file`: single file of any type, stored in wayback, represented as fatcat `file`
-- `web-fileset`: multiple files of any type, stored in wayback, represented as fatcat `fileset`
-- `web-fileset-bundled`: single bundle file, stored in wayback, represented as fatcat `fileset`
-- `archiveorg-file`: single file of any type, stored in archive.org item, represented as fatcat `file`
-- `archiveorg-fileset`: multiple files of any type, stored in archive.org item, represented as fatcat `fileset`
-- `archiveorg-fileset-bundled`: single bundle file, stored in archive.org item, represented as fatcat `fileset`
-
-"Bundle" files are things like .zip or .tar.gz. Not all .zip files are handled
-as bundles! Only when the transfer from the hosting platform is via a "download
-all as .zip" (or similar) do we consider a zipfile a "bundle" and index the
-interior files as a fileset.
-
-The term "bundle file" is used over "archive file" or "container file" to
-prevent confusion with the other use of those terms in the context of fatcat
-(container entities; archive; Internet Archive as an organiztion).
-
-The motivation for supporting both `web` and `archiveorg` is that `web` is
-somewhat simpler for small files, but `archiveorg` is better for larger groups
-of files (say more than 20) and larger total size (say more than 1 GByte total,
-or 128 MByte for any one file).
-
-The motivation for supporting "bundled" filesets is that there is only a single
-file to archive.
-
-
-## Ingest Pseudocode
-
-1. Determine `platform`, which may involve resolving redirects and crawling a landing page.
-
- a. TODO: do we always try crawling `base_url`? would simplify code flow, but results in extra SPN calls (slow). start with yes, always
- b. TODO: what if we trivially crawl directly to a non-HTML file? Bypass most of the below? `direct-file` strategy?
- c. `infer_platform(request, terminal_url, html_biblio)`
-
-2. Use platform-specific methods to fetch manifest metadata and decide on an `ingest_strategy`.
-
-3. Use strategy-specific methods to archive all files in platform manifest, and verify manifest metadata.
-
-4. Summarize status and return structured result metadata.
-
-Python APIs, as abstract classes (TODO):
-
- PlatformDatasetContext
- platform_name
- platform_domain
- platform_id
- manifest
- archiveorg_metadata
- web_base_url
- DatasetPlatformHelper
- match_request(request: Request, resource: Resource, html_biblio: Optional[BiblioMetadata]) -> bool
- process_request(?) -> ?
- StrategyArchiver
- process(manifest, archiveorg_metadata, web_metadata) -> ?
- check_existing(?) -> ?
-
-
-## New Sandcrawler Code and Worker
-
- sandcrawler-ingest-fileset-worker@{1..12}
-
-Worker consumes from ingest request topic, produces to fileset ingest results,
-and optionally produces to file ingest results.
-
- sandcrawler-persist-ingest-fileset-worker@1
-
-Simply writes fileset ingest rows in to SQL.
-
-## New Fatcat Worker and Code Changes
-
- fatcat-import-ingest-fileset-worker
-
-This importer should be modeled on file and web worker. Filters for `success`
-with strategy of `*-fileset*`.
-
-Existing `fatcat-import-ingest-file-worker` should be updated to allow
-`dataset` single-file imports, with largely same behavior and semantics as
-current importer.
-
-TODO: Existing fatcat transforms, and possibly even elasticsearch schemas,
-should be updated to include fileset status and `in_ia` flag for dataset type
-releases.
-
-TODO: Existing entity updates worker submits `dataset` type ingests to ingest
-request topic.
-
-
-## New SQL Tables
-
- CREATE TABLE IF NOT EXISTS ingest_fileset_result (
- ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
- base_url TEXT NOT NULL CHECK (octet_length(base_url) >= 1),
- updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
- hit BOOLEAN NOT NULL,
- status TEXT CHECK (octet_length(status) >= 1),
-
- terminal_url TEXT CHECK (octet_length(terminal_url) >= 1),
- terminal_dt TEXT CHECK (octet_length(terminal_dt) = 14),
- terminal_status_code INT,
- terminal_sha1hex TEXT CHECK (octet_length(terminal_sha1hex) = 40),
-
- platform TEXT CHECK (octet_length(platform) >= 1),
- platform_domain TEXT CHECK (octet_length(platform_domain) >= 1),
- platform_id TEXT CHECK (octet_length(platform_id) >= 1),
- ingest_strategy TEXT CHECK (octet_length(ingest_strategy) >= 1),
- total_size BIGINT,
- file_count INT,
- item_name TEXT CHECK (octet_length(item_name) >= 1),
- item_bundle_path TEXT CHECK (octet_length(item_path_bundle) >= 1),
-
- manifest JSONB,
- -- list, similar to fatcat fileset manifest, plus extra:
- -- status (str)
- -- path (str)
- -- size (int)
- -- md5 (str)
- -- sha1 (str)
- -- sha256 (str)
- -- mimetype (str)
- -- platform_url (str)
- -- terminal_url (str)
- -- terminal_dt (str)
- -- extra (dict) (?)
-
- PRIMARY KEY (ingest_type, base_url)
- );
- CREATE INDEX ingest_fileset_result_terminal_url_idx ON ingest_fileset_result(terminal_url);
-
-
-## New Kafka Topic and JSON Schema
-
-
- sandcrawler-ENV.ingest-fileset-results 6x, no retention limit
-
-
-## Implementation Plan
-
-First implement ingest worker, including platform and strategy helpers, and
-test those as simple stdin/stdout CLI tools in sandcrawler repo to validate
-this proposal.
-
-Second implement fatcat importer and test locally and/or in QA.
-
-Lastly implement infrastructure, automation, and other "glue".
-
-
-## Example Entities
-
-### ArchiveOrg: CAT dataset
-
-<https://archive.org/details/CAT_DATASET>
-
-`release_36vy7s5gtba67fmyxlmijpsaui`
-
-###
-
-<https://archive.org/details/academictorrents_70e0794e2292fc051a13f05ea6f5b6c16f3d3635>
-
-doi:10.1371/journal.pone.0120448
-
-Single .rar file
-
-### Dataverse
-
-<https://dataverse.rsu.lv/dataset.xhtml?persistentId=doi:10.48510/FK2/IJO02B>
-
-Single excel file
-
-### Dataverse
-
-<https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CLSFKX&version=1.1>
-
-doi:10.7910/DVN/CLSFKX
-
-Mulitple files; multiple versions?
-
-API fetch: <https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId=doi:10.7910/DVN/CLSFKX&version=1.1>
-
- .data.id
- .data.latestVersion.datasetPersistentId
- .data.latestVersion.versionNumber, .versionMinorNumber
- .data.latestVersion.files[]
- .dataFile
- .contentType (mimetype)
- .filename
- .filesize (int, bytes)
- .md5
- .persistendId
- .description
- .label (filename?)
- .version
-
-Single file inside: <https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/CLSFKX/XWEHBB>
-
-Download single file: <https://dataverse.harvard.edu/api/access/datafile/:persistentId/?persistentId=doi:10.7910/DVN/CLSFKX/XWEHBB> (redirects to AWS S3)
-
-Dataverse refs:
-- 'doi' and 'hdl' are the two persistentId styles
-- file-level persistentIds are optional, on a per-instance basis: https://guides.dataverse.org/en/latest/installation/config.html#filepidsenabled