aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-11-24 16:01:47 -0800
committerBryan Newbold <bnewbold@archive.org>2021-11-24 16:01:51 -0800
commitd93d542adf9d26633b0f3cfa361277ca677c46f3 (patch)
treec133d3030746afe25300a2e12a7645407a89b623
parentb4ca684c83d77a9fc6e7844ea8c45dfcb72aacb4 (diff)
downloadsandcrawler-d93d542adf9d26633b0f3cfa361277ca677c46f3.tar.gz
sandcrawler-d93d542adf9d26633b0f3cfa361277ca677c46f3.zip
codespell fixes in proposals
-rw-r--r--proposals/2019_ingest.md4
-rw-r--r--proposals/20200129_pdf_ingest.md8
-rw-r--r--proposals/20201012_no_capture.md2
-rw-r--r--proposals/20201103_xml_ingest.md2
-rw-r--r--proposals/2020_pdf_meta_thumbnails.md2
-rw-r--r--proposals/2020_seaweed_s3.md2
-rw-r--r--proposals/2021-09-09_fileset_ingest.md8
-rw-r--r--proposals/2021-10-28_grobid_refs.md4
8 files changed, 16 insertions, 16 deletions
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md
index c649809..c05c9df 100644
--- a/proposals/2019_ingest.md
+++ b/proposals/2019_ingest.md
@@ -84,7 +84,7 @@ HTML? Or both? Let's just recrawl.
*IngestRequest*
- `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For
backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and
- `xml` return file ingest respose; `html` and `dataset` not implemented but
+ `xml` return file ingest response; `html` and `dataset` not implemented but
would be webcapture (wayback) and fileset (archive.org item or wayback?).
In the future: `epub`, `video`, `git`, etc.
- `base_url`: required, where to start crawl process
@@ -258,7 +258,7 @@ and hacks to crawl publicly available papers. Related existing work includes
[unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's
efforts, zotero's bibliography extractor, etc. The "memento tracer" work is
also similar. Many of these are even in python! It would be great to reduce
-duplicated work and maintenance. An analagous system in the wild is youtube-dl
+duplicated work and maintenance. An analogous system in the wild is youtube-dl
for downloading video from many sources.
[unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md
index 9469217..620ed09 100644
--- a/proposals/20200129_pdf_ingest.md
+++ b/proposals/20200129_pdf_ingest.md
@@ -27,7 +27,7 @@ There are a few million papers in fatacat which:
2. are known OA, usually because publication is Gold OA
3. don't have any fulltext PDF in fatcat
-As a detail, some of these "known OA" journals actually have embargos (aka,
+As a detail, some of these "known OA" journals actually have embargoes (aka,
they aren't true Gold OA). In particular, those marked via EZB OA "color", and
recent pubmed central ids.
@@ -104,7 +104,7 @@ Actions:
update ingest result table with status.
- fetch new MAG and unpaywall seedlists, transform to ingest requests, persist
into ingest request table. use SQL to dump only the *new* URLs (not seen in
- previous dumps) using the created timestamp, outputing new bulk ingest
+ previous dumps) using the created timestamp, outputting new bulk ingest
request lists. if possible, de-dupe between these two. then start bulk
heritrix crawls over these two long lists. Probably sharded over several
machines. Could also run serially (first one, then the other, with
@@ -133,7 +133,7 @@ We have run GROBID+glutton over basically all of these PDFs. We should be able
to do a SQL query to select PDFs that:
- have at least one known CDX row
-- GROBID processed successfuly and glutton matched to a fatcat release
+- GROBID processed successfully and glutton matched to a fatcat release
- do not have an existing fatcat file (based on sha1hex)
- output GROBID metadata, `file_meta`, and one or more CDX rows
@@ -161,7 +161,7 @@ Coding Tasks:
Actions:
- update `fatcat_file` sandcrawler table
-- check how many PDFs this might ammount to. both by uniq SHA1 and uniq
+- check how many PDFs this might amount to. both by uniq SHA1 and uniq
`fatcat_release` matches
- do some manual random QA verification to check that this method results in
quality content in fatcat
diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md
index bb47ea2..27c14d1 100644
--- a/proposals/20201012_no_capture.md
+++ b/proposals/20201012_no_capture.md
@@ -29,7 +29,7 @@ The current status quo is to store the missing URL as the last element in the
pipeline that would read from the Kafka feed and extract them, but this would
be messy. Eg, re-ingesting would not update the old kafka messages, so we could
need some accounting of consumer group offsets after which missing URLs are
-truely missing.
+truly missing.
We could add a new `missing_url` database column and field to the JSON schema,
for this specific use case. This seems like unnecessary extra work.
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md
index 181cc11..25ec973 100644
--- a/proposals/20201103_xml_ingest.md
+++ b/proposals/20201103_xml_ingest.md
@@ -37,7 +37,7 @@ document. For recording in fatcat, the file metadata will be passed through.
For storing in Kafka and blob store (for downstream analysis), we will parse
the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8
encoding. The hash of the *original* XML file will be used as the key for
-refering to this document. This is unintuitive, but similar to what we are
+referring to this document. This is unintuitive, but similar to what we are
doing with PDF and HTML documents (extracting in a useful format, but keeping
the original document's hash as a key).
diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md
index 793d6b5..f231a7f 100644
--- a/proposals/2020_pdf_meta_thumbnails.md
+++ b/proposals/2020_pdf_meta_thumbnails.md
@@ -133,7 +133,7 @@ Deployment will involve:
Plan for processing/catchup is:
- test with COVID-19 PDF corpus
-- run extraction on all current fatcat files avaiable via IA
+- run extraction on all current fatcat files available via IA
- integrate with ingest pipeline for all new files
- run a batch catchup job over all GROBID-parsed files with no pdf meta
extracted, on basis of SQL table query
diff --git a/proposals/2020_seaweed_s3.md b/proposals/2020_seaweed_s3.md
index 5f4ff0b..677393b 100644
--- a/proposals/2020_seaweed_s3.md
+++ b/proposals/2020_seaweed_s3.md
@@ -316,7 +316,7 @@ grows very much with the number of volumes. Therefore, keep default volume size
and do not limit number of volumes `-volume.max 0` and do not use in-memory
index (rather leveldb)
-Status: done, 200M object upload via Python script sucessfully in about 6 days,
+Status: done, 200M object upload via Python script successfully in about 6 days,
memory usage was at a moderate 400M (~10% of RAM). Relatively constant
performance at about 400 `PutObject` requests/s (over 5 threads, each thread
was around 80 requests/s; then testing with 4 threads, each thread got to
diff --git a/proposals/2021-09-09_fileset_ingest.md b/proposals/2021-09-09_fileset_ingest.md
index b0d273e..65c9ccf 100644
--- a/proposals/2021-09-09_fileset_ingest.md
+++ b/proposals/2021-09-09_fileset_ingest.md
@@ -19,10 +19,10 @@ datasets:
specific to individual platforms and host software packages
- the storage backend and fatcat entity type is flexible: a dataset might be
represented by a single file, multiple files combined in to a single .zip
- file, or mulitple separate files; the data may get archived in wayback or in
+ file, or multiple separate files; the data may get archived in wayback or in
an archive.org item
-The new concepts of "strategy" and "platform" are introduced to accomodate
+The new concepts of "strategy" and "platform" are introduced to accommodate
these complications.
@@ -56,7 +56,7 @@ is via a "download all as .zip" (or similar) do we consider a zipfile a
The term "bundle file" is used over "archive file" or "container file" to
prevent confusion with the other use of those terms in the context of fatcat
-(container entities; archive; Internet Archive as an organiztion).
+(container entities; archive; Internet Archive as an organization).
The motivation for supporting both `web` and `archiveorg` is that `web` is
somewhat simpler for small files, but `archiveorg` is better for larger groups
@@ -155,7 +155,7 @@ New python APIs/classes:
valid platform, which could be found via API or parsing, but has the wrong
scope. Eg, tried to fetch a dataset, but got a DOI which represents all
versions of the dataset, not a specific version.
-- `platform-restricted`/`PlatformRestrictedError`: for, eg, embargos
+- `platform-restricted`/`PlatformRestrictedError`: for, eg, embargoes
- `platform-404`: got to a landing page, and seemed like in-scope, but no
platform record found anyways
diff --git a/proposals/2021-10-28_grobid_refs.md b/proposals/2021-10-28_grobid_refs.md
index 3f87968..1fc79b6 100644
--- a/proposals/2021-10-28_grobid_refs.md
+++ b/proposals/2021-10-28_grobid_refs.md
@@ -59,11 +59,11 @@ attached reference-level key or id.
We may want to do re-parsing of references from sources other than `crossref`,
so there is a generic `grobid_refs` table. But it is also common to fetch both
-the crossref metadata and any re-parsed references together, so as a convience
+the crossref metadata and any re-parsed references together, so as a convenience
there is a PostgreSQL view (virtual table) that includes both a crossref
metadata record and parsed citations, if available. If downstream code cares a
lot about having the refs and record be in sync, the `source_ts` field on
-`grobid_refs` can be matched againt the `indexed` column of `crossref` (or the
+`grobid_refs` can be matched against the `indexed` column of `crossref` (or the
`.indexed.date-time` JSON field in the record itself).
Remember that DOIs should always be lower-cased before querying, inserting,