diff options
author | Bryan Newbold <bnewbold@archive.org> | 2021-11-24 16:01:47 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2021-11-24 16:01:51 -0800 |
commit | d93d542adf9d26633b0f3cfa361277ca677c46f3 (patch) | |
tree | c133d3030746afe25300a2e12a7645407a89b623 /proposals/20200129_pdf_ingest.md | |
parent | b4ca684c83d77a9fc6e7844ea8c45dfcb72aacb4 (diff) | |
download | sandcrawler-d93d542adf9d26633b0f3cfa361277ca677c46f3.tar.gz sandcrawler-d93d542adf9d26633b0f3cfa361277ca677c46f3.zip |
codespell fixes in proposals
Diffstat (limited to 'proposals/20200129_pdf_ingest.md')
-rw-r--r-- | proposals/20200129_pdf_ingest.md | 8 |
1 files changed, 4 insertions, 4 deletions
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md index 9469217..620ed09 100644 --- a/proposals/20200129_pdf_ingest.md +++ b/proposals/20200129_pdf_ingest.md @@ -27,7 +27,7 @@ There are a few million papers in fatacat which: 2. are known OA, usually because publication is Gold OA 3. don't have any fulltext PDF in fatcat -As a detail, some of these "known OA" journals actually have embargos (aka, +As a detail, some of these "known OA" journals actually have embargoes (aka, they aren't true Gold OA). In particular, those marked via EZB OA "color", and recent pubmed central ids. @@ -104,7 +104,7 @@ Actions: update ingest result table with status. - fetch new MAG and unpaywall seedlists, transform to ingest requests, persist into ingest request table. use SQL to dump only the *new* URLs (not seen in - previous dumps) using the created timestamp, outputing new bulk ingest + previous dumps) using the created timestamp, outputting new bulk ingest request lists. if possible, de-dupe between these two. then start bulk heritrix crawls over these two long lists. Probably sharded over several machines. Could also run serially (first one, then the other, with @@ -133,7 +133,7 @@ We have run GROBID+glutton over basically all of these PDFs. We should be able to do a SQL query to select PDFs that: - have at least one known CDX row -- GROBID processed successfuly and glutton matched to a fatcat release +- GROBID processed successfully and glutton matched to a fatcat release - do not have an existing fatcat file (based on sha1hex) - output GROBID metadata, `file_meta`, and one or more CDX rows @@ -161,7 +161,7 @@ Coding Tasks: Actions: - update `fatcat_file` sandcrawler table -- check how many PDFs this might ammount to. both by uniq SHA1 and uniq +- check how many PDFs this might amount to. both by uniq SHA1 and uniq `fatcat_release` matches - do some manual random QA verification to check that this method results in quality content in fatcat |