diff options
Diffstat (limited to 'proposals/2019_ingest.md')
-rw-r--r-- | proposals/2019_ingest.md | 6 |
1 files changed, 3 insertions, 3 deletions
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md index c649809..768784f 100644 --- a/proposals/2019_ingest.md +++ b/proposals/2019_ingest.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed This document proposes structure and systems for ingesting (crawling) paper PDFs and other content as part of sandcrawler. @@ -84,7 +84,7 @@ HTML? Or both? Let's just recrawl. *IngestRequest* - `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and - `xml` return file ingest respose; `html` and `dataset` not implemented but + `xml` return file ingest response; `html` and `dataset` not implemented but would be webcapture (wayback) and fileset (archive.org item or wayback?). In the future: `epub`, `video`, `git`, etc. - `base_url`: required, where to start crawl process @@ -258,7 +258,7 @@ and hacks to crawl publicly available papers. Related existing work includes [unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's efforts, zotero's bibliography extractor, etc. The "memento tracer" work is also similar. Many of these are even in python! It would be great to reduce -duplicated work and maintenance. An analagous system in the wild is youtube-dl +duplicated work and maintenance. An analogous system in the wild is youtube-dl for downloading video from many sources. [unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py |