diff options
author | Bryan Newbold <bnewbold@archive.org> | 2021-11-24 16:01:47 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2021-11-24 16:01:51 -0800 |
commit | d93d542adf9d26633b0f3cfa361277ca677c46f3 (patch) | |
tree | c133d3030746afe25300a2e12a7645407a89b623 /proposals | |
parent | b4ca684c83d77a9fc6e7844ea8c45dfcb72aacb4 (diff) | |
download | sandcrawler-d93d542adf9d26633b0f3cfa361277ca677c46f3.tar.gz sandcrawler-d93d542adf9d26633b0f3cfa361277ca677c46f3.zip |
codespell fixes in proposals
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/2019_ingest.md | 4 | ||||
-rw-r--r-- | proposals/20200129_pdf_ingest.md | 8 | ||||
-rw-r--r-- | proposals/20201012_no_capture.md | 2 | ||||
-rw-r--r-- | proposals/20201103_xml_ingest.md | 2 | ||||
-rw-r--r-- | proposals/2020_pdf_meta_thumbnails.md | 2 | ||||
-rw-r--r-- | proposals/2020_seaweed_s3.md | 2 | ||||
-rw-r--r-- | proposals/2021-09-09_fileset_ingest.md | 8 | ||||
-rw-r--r-- | proposals/2021-10-28_grobid_refs.md | 4 |
8 files changed, 16 insertions, 16 deletions
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md index c649809..c05c9df 100644 --- a/proposals/2019_ingest.md +++ b/proposals/2019_ingest.md @@ -84,7 +84,7 @@ HTML? Or both? Let's just recrawl. *IngestRequest* - `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and - `xml` return file ingest respose; `html` and `dataset` not implemented but + `xml` return file ingest response; `html` and `dataset` not implemented but would be webcapture (wayback) and fileset (archive.org item or wayback?). In the future: `epub`, `video`, `git`, etc. - `base_url`: required, where to start crawl process @@ -258,7 +258,7 @@ and hacks to crawl publicly available papers. Related existing work includes [unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's efforts, zotero's bibliography extractor, etc. The "memento tracer" work is also similar. Many of these are even in python! It would be great to reduce -duplicated work and maintenance. An analagous system in the wild is youtube-dl +duplicated work and maintenance. An analogous system in the wild is youtube-dl for downloading video from many sources. [unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md index 9469217..620ed09 100644 --- a/proposals/20200129_pdf_ingest.md +++ b/proposals/20200129_pdf_ingest.md @@ -27,7 +27,7 @@ There are a few million papers in fatacat which: 2. are known OA, usually because publication is Gold OA 3. don't have any fulltext PDF in fatcat -As a detail, some of these "known OA" journals actually have embargos (aka, +As a detail, some of these "known OA" journals actually have embargoes (aka, they aren't true Gold OA). In particular, those marked via EZB OA "color", and recent pubmed central ids. @@ -104,7 +104,7 @@ Actions: update ingest result table with status. - fetch new MAG and unpaywall seedlists, transform to ingest requests, persist into ingest request table. use SQL to dump only the *new* URLs (not seen in - previous dumps) using the created timestamp, outputing new bulk ingest + previous dumps) using the created timestamp, outputting new bulk ingest request lists. if possible, de-dupe between these two. then start bulk heritrix crawls over these two long lists. Probably sharded over several machines. Could also run serially (first one, then the other, with @@ -133,7 +133,7 @@ We have run GROBID+glutton over basically all of these PDFs. We should be able to do a SQL query to select PDFs that: - have at least one known CDX row -- GROBID processed successfuly and glutton matched to a fatcat release +- GROBID processed successfully and glutton matched to a fatcat release - do not have an existing fatcat file (based on sha1hex) - output GROBID metadata, `file_meta`, and one or more CDX rows @@ -161,7 +161,7 @@ Coding Tasks: Actions: - update `fatcat_file` sandcrawler table -- check how many PDFs this might ammount to. both by uniq SHA1 and uniq +- check how many PDFs this might amount to. both by uniq SHA1 and uniq `fatcat_release` matches - do some manual random QA verification to check that this method results in quality content in fatcat diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md index bb47ea2..27c14d1 100644 --- a/proposals/20201012_no_capture.md +++ b/proposals/20201012_no_capture.md @@ -29,7 +29,7 @@ The current status quo is to store the missing URL as the last element in the pipeline that would read from the Kafka feed and extract them, but this would be messy. Eg, re-ingesting would not update the old kafka messages, so we could need some accounting of consumer group offsets after which missing URLs are -truely missing. +truly missing. We could add a new `missing_url` database column and field to the JSON schema, for this specific use case. This seems like unnecessary extra work. diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md index 181cc11..25ec973 100644 --- a/proposals/20201103_xml_ingest.md +++ b/proposals/20201103_xml_ingest.md @@ -37,7 +37,7 @@ document. For recording in fatcat, the file metadata will be passed through. For storing in Kafka and blob store (for downstream analysis), we will parse the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8 encoding. The hash of the *original* XML file will be used as the key for -refering to this document. This is unintuitive, but similar to what we are +referring to this document. This is unintuitive, but similar to what we are doing with PDF and HTML documents (extracting in a useful format, but keeping the original document's hash as a key). diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md index 793d6b5..f231a7f 100644 --- a/proposals/2020_pdf_meta_thumbnails.md +++ b/proposals/2020_pdf_meta_thumbnails.md @@ -133,7 +133,7 @@ Deployment will involve: Plan for processing/catchup is: - test with COVID-19 PDF corpus -- run extraction on all current fatcat files avaiable via IA +- run extraction on all current fatcat files available via IA - integrate with ingest pipeline for all new files - run a batch catchup job over all GROBID-parsed files with no pdf meta extracted, on basis of SQL table query diff --git a/proposals/2020_seaweed_s3.md b/proposals/2020_seaweed_s3.md index 5f4ff0b..677393b 100644 --- a/proposals/2020_seaweed_s3.md +++ b/proposals/2020_seaweed_s3.md @@ -316,7 +316,7 @@ grows very much with the number of volumes. Therefore, keep default volume size and do not limit number of volumes `-volume.max 0` and do not use in-memory index (rather leveldb) -Status: done, 200M object upload via Python script sucessfully in about 6 days, +Status: done, 200M object upload via Python script successfully in about 6 days, memory usage was at a moderate 400M (~10% of RAM). Relatively constant performance at about 400 `PutObject` requests/s (over 5 threads, each thread was around 80 requests/s; then testing with 4 threads, each thread got to diff --git a/proposals/2021-09-09_fileset_ingest.md b/proposals/2021-09-09_fileset_ingest.md index b0d273e..65c9ccf 100644 --- a/proposals/2021-09-09_fileset_ingest.md +++ b/proposals/2021-09-09_fileset_ingest.md @@ -19,10 +19,10 @@ datasets: specific to individual platforms and host software packages - the storage backend and fatcat entity type is flexible: a dataset might be represented by a single file, multiple files combined in to a single .zip - file, or mulitple separate files; the data may get archived in wayback or in + file, or multiple separate files; the data may get archived in wayback or in an archive.org item -The new concepts of "strategy" and "platform" are introduced to accomodate +The new concepts of "strategy" and "platform" are introduced to accommodate these complications. @@ -56,7 +56,7 @@ is via a "download all as .zip" (or similar) do we consider a zipfile a The term "bundle file" is used over "archive file" or "container file" to prevent confusion with the other use of those terms in the context of fatcat -(container entities; archive; Internet Archive as an organiztion). +(container entities; archive; Internet Archive as an organization). The motivation for supporting both `web` and `archiveorg` is that `web` is somewhat simpler for small files, but `archiveorg` is better for larger groups @@ -155,7 +155,7 @@ New python APIs/classes: valid platform, which could be found via API or parsing, but has the wrong scope. Eg, tried to fetch a dataset, but got a DOI which represents all versions of the dataset, not a specific version. -- `platform-restricted`/`PlatformRestrictedError`: for, eg, embargos +- `platform-restricted`/`PlatformRestrictedError`: for, eg, embargoes - `platform-404`: got to a landing page, and seemed like in-scope, but no platform record found anyways diff --git a/proposals/2021-10-28_grobid_refs.md b/proposals/2021-10-28_grobid_refs.md index 3f87968..1fc79b6 100644 --- a/proposals/2021-10-28_grobid_refs.md +++ b/proposals/2021-10-28_grobid_refs.md @@ -59,11 +59,11 @@ attached reference-level key or id. We may want to do re-parsing of references from sources other than `crossref`, so there is a generic `grobid_refs` table. But it is also common to fetch both -the crossref metadata and any re-parsed references together, so as a convience +the crossref metadata and any re-parsed references together, so as a convenience there is a PostgreSQL view (virtual table) that includes both a crossref metadata record and parsed citations, if available. If downstream code cares a lot about having the refs and record be in sync, the `source_ts` field on -`grobid_refs` can be matched againt the `indexed` column of `crossref` (or the +`grobid_refs` can be matched against the `indexed` column of `crossref` (or the `.indexed.date-time` JSON field in the record itself). Remember that DOIs should always be lower-cased before querying, inserting, |