From d93d542adf9d26633b0f3cfa361277ca677c46f3 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 24 Nov 2021 16:01:47 -0800 Subject: codespell fixes in proposals --- proposals/2019_ingest.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'proposals/2019_ingest.md') diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md index c649809..c05c9df 100644 --- a/proposals/2019_ingest.md +++ b/proposals/2019_ingest.md @@ -84,7 +84,7 @@ HTML? Or both? Let's just recrawl. *IngestRequest* - `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and - `xml` return file ingest respose; `html` and `dataset` not implemented but + `xml` return file ingest response; `html` and `dataset` not implemented but would be webcapture (wayback) and fileset (archive.org item or wayback?). In the future: `epub`, `video`, `git`, etc. - `base_url`: required, where to start crawl process @@ -258,7 +258,7 @@ and hacks to crawl publicly available papers. Related existing work includes [unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's efforts, zotero's bibliography extractor, etc. The "memento tracer" work is also similar. Many of these are even in python! It would be great to reduce -duplicated work and maintenance. An analagous system in the wild is youtube-dl +duplicated work and maintenance. An analogous system in the wild is youtube-dl for downloading video from many sources. [unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py -- cgit v1.2.3