aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/20200129_pdf_ingest.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-11-24 16:01:47 -0800
committerBryan Newbold <bnewbold@archive.org>2021-11-24 16:01:51 -0800
commitd93d542adf9d26633b0f3cfa361277ca677c46f3 (patch)
treec133d3030746afe25300a2e12a7645407a89b623 /proposals/20200129_pdf_ingest.md
parentb4ca684c83d77a9fc6e7844ea8c45dfcb72aacb4 (diff)
downloadsandcrawler-d93d542adf9d26633b0f3cfa361277ca677c46f3.tar.gz
sandcrawler-d93d542adf9d26633b0f3cfa361277ca677c46f3.zip
codespell fixes in proposals
Diffstat (limited to 'proposals/20200129_pdf_ingest.md')
-rw-r--r--proposals/20200129_pdf_ingest.md8
1 files changed, 4 insertions, 4 deletions
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md
index 9469217..620ed09 100644
--- a/proposals/20200129_pdf_ingest.md
+++ b/proposals/20200129_pdf_ingest.md
@@ -27,7 +27,7 @@ There are a few million papers in fatacat which:
2. are known OA, usually because publication is Gold OA
3. don't have any fulltext PDF in fatcat
-As a detail, some of these "known OA" journals actually have embargos (aka,
+As a detail, some of these "known OA" journals actually have embargoes (aka,
they aren't true Gold OA). In particular, those marked via EZB OA "color", and
recent pubmed central ids.
@@ -104,7 +104,7 @@ Actions:
update ingest result table with status.
- fetch new MAG and unpaywall seedlists, transform to ingest requests, persist
into ingest request table. use SQL to dump only the *new* URLs (not seen in
- previous dumps) using the created timestamp, outputing new bulk ingest
+ previous dumps) using the created timestamp, outputting new bulk ingest
request lists. if possible, de-dupe between these two. then start bulk
heritrix crawls over these two long lists. Probably sharded over several
machines. Could also run serially (first one, then the other, with
@@ -133,7 +133,7 @@ We have run GROBID+glutton over basically all of these PDFs. We should be able
to do a SQL query to select PDFs that:
- have at least one known CDX row
-- GROBID processed successfuly and glutton matched to a fatcat release
+- GROBID processed successfully and glutton matched to a fatcat release
- do not have an existing fatcat file (based on sha1hex)
- output GROBID metadata, `file_meta`, and one or more CDX rows
@@ -161,7 +161,7 @@ Coding Tasks:
Actions:
- update `fatcat_file` sandcrawler table
-- check how many PDFs this might ammount to. both by uniq SHA1 and uniq
+- check how many PDFs this might amount to. both by uniq SHA1 and uniq
`fatcat_release` matches
- do some manual random QA verification to check that this method results in
quality content in fatcat