diff options
50 files changed, 86 insertions, 80 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md index 057f1afe..32257c1f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,12 @@ See also: - [Semantic Versioning](https://semver.org/spec/v2.0.0.html) +## UNRELEASED + +### Fixed + +- various typos and spelling errors corrected (using `codespell`) + ## [0.5.0] - 2021-11-22 Small change to the API schema (and SQL schema), adding the `content_scope` @@ -29,7 +35,7 @@ file, may be reversed in API responses compared to what was returned previously. They should not match what was original supplied when the entity was created. -In particular, this may cause broad discrepencies compared to historical bulk +In particular, this may cause broad discrepancies compared to historical bulk metadata exports. New bulk exports will be generated with the new ordering. A number of content cleanups and changes are also taking place to the primary @@ -92,7 +92,7 @@ Want to minimize edit counts, so will bundle a bunch of changes - maybe better 'success' return message? eg, "success: true" flag - idea: allow users to generate their own editgroup UUIDs, to reduce a round trips and "hanging" editgroups (created but never edited) -- refactor API schema for some entity-generic methos (eg, history, edit +- refactor API schema for some entity-generic methods (eg, history, edit operations) to take entity type as a URL path param. greatly reduce macro foolery and method count/complexity, and ease creation of new entities => /{entity}/edit/{edit_id} @@ -161,7 +161,7 @@ new importers: convert JATS if necessary - switch from slog to simple pretty_env_log - format returned datetimes with only second precision, not millisecond (RFC mode) - => burried in model serialization internals + => buried in model serialization internals - refactor openapi schema to use shared response types - consider using "HTTP 202: Accepted" for entity-mutating calls - basic python hbase/elastic matcher diff --git a/extra/elasticsearch/README.md b/extra/elasticsearch/README.md index edb4f1f6..90019147 100644 --- a/extra/elasticsearch/README.md +++ b/extra/elasticsearch/README.md @@ -83,7 +83,7 @@ a new index and then cut over with no downtime. http put :9200/fatcat_release_v03 < release_schema.json -To replace a "real" index with an alias pointer, do two actions (not truely +To replace a "real" index with an alias pointer, do two actions (not truly zero-downtime, but pretty fast): http delete :9200/fatcat_release diff --git a/extra/journal_metadata/README.md b/extra/journal_metadata/README.md index dec32624..cae52de3 100644 --- a/extra/journal_metadata/README.md +++ b/extra/journal_metadata/README.md @@ -2,7 +2,7 @@ This folder contains scripts to merge journal metadat from multiple sources and provide a snapshot for bulk importing into fatcat. -Specific bots will probably be needed to do continous updates; that's out of +Specific bots will probably be needed to do continuous updates; that's out of scope for this first import. diff --git a/extra/sitemap/README.md b/extra/sitemap/README.md index 581ee9f3..9f0dd4b0 100644 --- a/extra/sitemap/README.md +++ b/extra/sitemap/README.md @@ -37,8 +37,8 @@ In tree form: Workflow: -- run bash script over container dump, outputing compressed, sharded container sitemaps -- run bash script over release work-grouped, outputing compressed, sharded release sitemaps +- run bash script over container dump, outputting compressed, sharded container sitemaps +- run bash script over release work-grouped, outputting compressed, sharded release sitemaps - run python script to output top-level `sitemap.xml` - `scp` all of this into place diff --git a/fatcat-openapi2.yml b/fatcat-openapi2.yml index 7fafdb89..bebaee1a 100644 --- a/fatcat-openapi2.yml +++ b/fatcat-openapi2.yml @@ -35,7 +35,7 @@ info: Fatcat is made available as a gratis (no cost) and libre (freedom preserving) service to the public, with limited funding and resources. We - welcome new and unforseen uses and contributions, but may need to impose + welcome new and unforeseen uses and contributions, but may need to impose restrictions (like rate-limits) to keep the service functional for other users, and in extreme cases reserve the option to block accounts and IP ranges if necessary to keep the service operational. @@ -167,7 +167,7 @@ tags: # TAGLINE description: | # TAGLINE **Fileset** entities represent sets of digital files, as well as locations # TAGLINE where they can be found on the public web. Filesets most commonly # TAGLINE - represent datasets consisting of serveral data and metadata files. # TAGLINE + represent datasets consisting of several data and metadata files. # TAGLINE See the "Catalog Style Guide" section of the guide for details and # TAGLINE semantics of what should be included in specific entity fields. # TAGLINE @@ -731,7 +731,7 @@ definitions: $ref: "#/definitions/container_entity" description: | Complete container entity identified by `container_id` field. Only - included in GET reponses when `container` included in `expand` + included in GET responses when `container` included in `expand` parameter; ignored in PUT or POST requests. files: type: array @@ -793,7 +793,7 @@ definitions: type: string example: "retracted" description: | - Type of withdrawl or retraction of this release, if applicable. If + Type of withdrawal or retraction of this release, if applicable. If release has not been withdrawn, should be `null` (aka, not set, not the string "null" or an empty string). withdrawn_date: @@ -825,7 +825,7 @@ definitions: example: "12" description: | Issue number of volume/container that this release was published in. - Sometimes coresponds to a month number in the year, but can be any + Sometimes corresponds to a month number in the year, but can be any string. See guide. pages: type: string @@ -1034,7 +1034,7 @@ definitions: description: | Username/handle (short slug-like string) to identify this editor. May be changed at any time by the editor; use the `editor_id` as a - persistend identifer. + persistend identifier. is_admin: type: boolean example: false @@ -1064,7 +1064,7 @@ definitions: <<: *FATCATIDENT <<: *FATCATIDENTEXAMPLE description: | - Fatcat identifer of editor that created this editgroup. + Fatcat identifier of editor that created this editgroup. editor: $ref: "#/definitions/editor" description: | @@ -1076,7 +1076,7 @@ definitions: format: int64 description: | For accepted/merged editgroups, the changelog index that the accept - occured at. WARNING: not populated in all contexts that an editgroup + occurred at. WARNING: not populated in all contexts that an editgroup could be included in a response. created: type: string @@ -2891,7 +2891,7 @@ paths: description: | Updates an existing entity as part of a specific (existing) editgroup. The editgroup must be open for updates (aka, not accepted/merged), and - the editor making the requiest must have permissions (aka, must have + the editor making the request must have permissions (aka, must have created the editgroup or have `admin` role). This method can also be used to update an existing entity edit as part diff --git a/fatcat-rfc.md b/fatcat-rfc.md index 13466df2..8b966bdf 100644 --- a/fatcat-rfc.md +++ b/fatcat-rfc.md @@ -63,7 +63,7 @@ to a rigid third-party ontology or schema. Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could -ingest or synchronize the databse in those formats. +ingest or synchronize the database in those formats. ## Licensing @@ -109,7 +109,7 @@ push through edits more rapidly (eg, importing new works from a publisher API). Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and -reverts managable. +reverts manageable. Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing @@ -153,7 +153,7 @@ In comparison, 96-bit identifiers would have 20 characters and look like: work_rzga5b9cd7efgh04iljk https://fatcat.wiki/work/rzga5b9cd7efgh04iljk -A 64-bit namespace would probably be large enought, and would work with +A 64-bit namespace would probably be large enough, and would work with database Integer columns: work_rzga5b9cd7efg @@ -170,7 +170,7 @@ entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems, this is the git model, not the mercurial model. -The entity revisions are immutable once accepted; the editting process involves +The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by *identifier* not *revision number*. Identifier pointers also support @@ -327,7 +327,7 @@ Some special namespace tables and enums would probably be helpful; these could live in the database (not requiring a database migration to update), but should have more controlled editing workflow... perhaps versioned in the codebase: -- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers +- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifiers themselves) - subject categorization - license and open access status diff --git a/guide/src/code_of_conduct.md b/guide/src/code_of_conduct.md index 803d68c7..216cc70c 100644 --- a/guide/src/code_of_conduct.md +++ b/guide/src/code_of_conduct.md @@ -33,7 +33,7 @@ unacceptable behavior, you can email <ethics@archive.org>. comment threads), as well as any physical spaces such as conference gatherings or meetups. -- All participants are expected to respect this code, irregardless of their +- All participants are expected to respect this code, regardless of their position or record of contributions to the project. ## Unacceptable behavior diff --git a/guide/src/editing_quickstart.md b/guide/src/editing_quickstart.md index 56fb2357..df413fcc 100644 --- a/guide/src/editing_quickstart.md +++ b/guide/src/editing_quickstart.md @@ -20,7 +20,7 @@ confirm your email, and then log-in to Fatcat using that. You should see your username in the upper right-hand corner of every page when you are successfully logged in. -Next find the release's fatcat identifer for the paper we want to add a file +Next find the release's fatcat identifier for the paper we want to add a file to. You can [search](https://fatcat.wiki/release/search) by title, or [lookup](https://fatcat.wiki/release/lookup) a paper by an identifier (such as a DOI or arXiv ID). If the release you are looking for doesn't exist yet, @@ -70,7 +70,7 @@ view should have a link to the release entity; follow that link, then click the This time, the most recent editgroup should already be selected, so you don't need to enter a description at the top. If there are any problems with basic metadata, go ahead and fix them, but otherwise skip down to the "Container" -section and update the fatcat identifer ("FCID") to point to the correct +section and update the fatcat identifier ("FCID") to point to the correct journal. You can [lookup journals](https://fatcat.wiki/container/lookup) by ISSN-L, or [search](https://fatcat.wiki/container/search) by title. Add a short description of your change ("Updated journal to XYZ") and then submit. @@ -81,7 +81,7 @@ editgroups from the drop-down link in the upper right-hand corner of every page (your username, then "Edit History"). The editgroup page shows all the entities created, updated, or deleted, and allows you to make tweaks (re-edit) or remove changes. If the release/container update you made was bogus (just as -a learning exersize), you could remove it here. It's a good practice to group +a learning exercize), you could remove it here. It's a good practice to group related edits into the same editgroup, but only up to 50 or so edits at a time (more than that becomes difficult hard to review). diff --git a/guide/src/entity_container.md b/guide/src/entity_container.md index 94201d90..dde7751b 100644 --- a/guide/src/entity_container.md +++ b/guide/src/entity_container.md @@ -108,7 +108,7 @@ preserved). - `suspended`: publication has stopped, but may continue in the future - `discontinued`: publication has permanently ceased - `vanished`: publication has stopped, and public traces have vanished (eg, - publisher website has disapeared with no notice) + publisher website has disappeared with no notice) - `never`: no works were ever published under this container - `one-time`: releases were all published as a one-time even. for example, a single instance of a conference, or a fixed-size book series diff --git a/guide/src/entity_release.md b/guide/src/entity_release.md index dd09b30b..ea67c5b5 100644 --- a/guide/src/entity_release.md +++ b/guide/src/entity_release.md @@ -147,7 +147,7 @@ complete or correct in more obscure cases. - `arxiv` (string): external identifier to a (version-specific) [arxiv.org][] work. For releases, must always include the `vN` suffix (eg, `v3`). - `jstor` (string): external identifier for works in JSTOR. -- `ark` (string): ARK identifer +- `ark` (string): ARK identifier - `mag` (deprecated; string): Microsoft Academic Graph identifier. Never used, may be deleted in the future - `doaj` (string): [DOAJ](https://doaj.org) article-level identifier @@ -323,7 +323,7 @@ print journal publication. Any value at all indicates that the release should be considered "no longer published by the publisher or primary host", which could mean different things in different contexts. As some concrete examples, works are often accidentally -generated a duplicate DOI; physics papers have been taken down in reponse to +generated a duplicate DOI; physics papers have been taken down in response to government order under national security justifications; papers have been withdrawn for public health reasons (above and beyond any academic-style retraction); entire journals may be found to be predatory and pulled from diff --git a/guide/src/reference_graph.md b/guide/src/reference_graph.md index 73dc7efe..0f3606e7 100644 --- a/guide/src/reference_graph.md +++ b/guide/src/reference_graph.md @@ -85,7 +85,7 @@ Specific sources: * `fatcat-datacite`: same as `crossref`, but for the Datacite DOI registrar. * `fatcat-pubmed`: references, linked or not linked, from Pubmed/MEDLINE metadata -* `fatcat`: references in fatcat where the original provenance can't be infered +* `fatcat`: references in fatcat where the original provenance can't be inferred (but could be manually found by inspecting the release edit history) * `grobid`: references parsed out of full-text PDFs using [GROBID](https://github.com/kermitt2/grobid) diff --git a/notes/UNSORTED.txt b/notes/UNSORTED.txt index 3960f5eb..850b54d0 100644 --- a/notes/UNSORTED.txt +++ b/notes/UNSORTED.txt @@ -3,7 +3,7 @@ Not allowed to PUT edits to the same entity in the same editgroup. If you want to update an edit, need to delete the old one first. The state depends only on the current entity state, not any redirect. This -means that if the target of a redirect is delted, the redirecting entity is +means that if the target of a redirect is deleted, the redirecting entity is still "redirect", not "deleted". Redirects-to-redirects are not allowed; this is enforced when the editgroup is @@ -31,7 +31,7 @@ redirects after some delay period. => it would not be too hard to update get_release_files to check for such redirects; could be handled by request flag? -`prev_rev` is naively set to the most-recent previous state. If the curent +`prev_rev` is naively set to the most-recent previous state. If the current state was deleted or a redirect, it is set to null. This parameter is not checked/enforced at edit accept time (but could be, and diff --git a/notes/bulk_edits/2019-10-08_file_cleanups.md b/notes/bulk_edits/2019-10-08_file_cleanups.md index b61b37f0..2eebb363 100644 --- a/notes/bulk_edits/2019-10-08_file_cleanups.md +++ b/notes/bulk_edits/2019-10-08_file_cleanups.md @@ -5,7 +5,7 @@ web.archive.org). These URLs were created accidentally during fatcat boostrapping; there are about 300k such file enties to fix. Will also update archive.org link reltype to 'archive' (instead of -'repository'), which is the new prefered style. +'repository'), which is the new preferred style. Generated the set of files to update like: diff --git a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md index b2fd29d5..56e88880 100644 --- a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md +++ b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md @@ -1,7 +1,7 @@ On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed metadata started. In the case of pubmed, updates are enabled, so that recently -created DOI releases get updated with PMID and extra metdata. +created DOI releases get updated with PMID and extra metadata. We also want to do last backfills of metadata since the last import up through the first day updated by the continuous harvester. diff --git a/notes/bulk_edits/2020-09-02_file_meta.md b/notes/bulk_edits/2020-09-02_file_meta.md index 35c4d87f..b0606f2d 100644 --- a/notes/bulk_edits/2020-09-02_file_meta.md +++ b/notes/bulk_edits/2020-09-02_file_meta.md @@ -25,7 +25,7 @@ Partial wayback URL timestamps, for cases where we have the full timestamped URL https://qa.fatcat.wiki/file/k73il3k5hzemtnkqa5qyorg6ci https://qa.fatcat.wiki/file/7hstlrabfjb6vgyph7ntqtpkne -Live-web URLs identical except for http/https flip or other trival things (much less frequent case): +Live-web URLs identical except for http/https flip or other trivial things (much less frequent case): http://eo1.gsfc.nasa.gov/new/validationReport/Technology/JoeCD/asner_etal_PNAS_20041.pdf https://eo1.gsfc.nasa.gov/new/validationReport/Technology/JoeCD/asner_etal_PNAS_20041.pdf diff --git a/notes/bulk_edits/2020-12-23_dblp.md b/notes/bulk_edits/2020-12-23_dblp.md index c3ad0587..a33411cb 100644 --- a/notes/bulk_edits/2020-12-23_dblp.md +++ b/notes/bulk_edits/2020-12-23_dblp.md @@ -52,4 +52,4 @@ Run import: => Counter({'total': 7953365, 'has-doi': 4277307, 'skip': 3097418, 'skip-key-type': 2640968, 'skip-update': 2480449, 'exists': 943800, 'update': 889700, 'insert': 338842, 'skip-arxiv-corr': 312872, 'exists-fuzzy': 203103, 'skip-dblp-container-missing': 143578, 'skip-arxiv': 53, 'skip-title': 1}) Starting database size (roughly): Size: 684.08G -Ending databse size: Size: 690.22G +Ending database size: Size: 690.22G diff --git a/notes/bulk_edits/2020_datacite.md b/notes/bulk_edits/2020_datacite.md index 005841ae..05d09517 100644 --- a/notes/bulk_edits/2020_datacite.md +++ b/notes/bulk_edits/2020_datacite.md @@ -54,7 +54,7 @@ Compare with `--lang-detect`: user 3m5.620s sys 0m13.344s -Not noticable? +Not noticeable? Whole run: diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md index e3ea942d..9db77058 100644 --- a/notes/cleanups/wayback_timestamps.md +++ b/notes/cleanups/wayback_timestamps.md @@ -1,6 +1,6 @@ -At some point, using the arabesque importer (from targetted crawling), we -accidentially imported a bunch of files with wayback URLs that have 12-digit +At some point, using the arabesque importer (from targeted crawling), we +accidentally imported a bunch of files with wayback URLs that have 12-digit timestamps, instead of the full canonical 14-digit timestamps. diff --git a/notes/data_model.md b/notes/data_model.md index 2d2825ae..f13e33cc 100644 --- a/notes/data_model.md +++ b/notes/data_model.md @@ -87,12 +87,12 @@ Each entity type has tables: core representation of a version of the entity _ident - persistant, external identifier + persistent, external identifier allows merging, unmerging, stable cross-entity references _edit represents change metadata for a single change to one ident - needed because an edit alwasy changes ident, but might not change rev + needed because an edit always changes ident, but might not change rev Could someday also have: diff --git a/notes/performance/postgres_performance.txt b/notes/performance/postgres_performance.txt index cd2a5162..ff8fcb3b 100644 --- a/notes/performance/postgres_performance.txt +++ b/notes/performance/postgres_performance.txt @@ -189,7 +189,7 @@ max_wal_size wasn't getting set correctly. The statements taking the most time are the complex inserts (multi-table inserts); they take a fraction of a second though (mean less than a -milisecond). +millisecond). Manifest import runs really slow if release import is concurrent; much faster to wait until release import is done first (like a factor of 10x or more). diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md index 8953448c..b0a484ad 100644 --- a/proposals/20190510_release_ext_ids.md +++ b/proposals/20190510_release_ext_ids.md @@ -23,7 +23,7 @@ sure this is worth it though. ## New API -All identifers as text +All identifiers as text release_entity ext_ids (required) diff --git a/proposals/202008_bulk_citation_graph.md b/proposals/202008_bulk_citation_graph.md index f8868e45..65db0d94 100644 --- a/proposals/202008_bulk_citation_graph.md +++ b/proposals/202008_bulk_citation_graph.md @@ -43,7 +43,7 @@ The high-level prosposal is: types - sort the "source" references into an index and run a merge-sort on bucket keys against the "target" index to generate candidate match buckets -- run python fuzzy match code against the candidate buckets, outputing a status +- run python fuzzy match code against the candidate buckets, outputting a status for each reference input and a list of all strong matches - resort successful matches and index by both source and target identifiers as output citation graph diff --git a/proposals/2020_client_cli.md b/proposals/2020_client_cli.md index 2a0c8fa1..01d190a8 100644 --- a/proposals/2020_client_cli.md +++ b/proposals/2020_client_cli.md @@ -69,7 +69,7 @@ Argument conventions: ':' Lookup specifier for entity (eg, external identifier like `doi:10.123/abc`) '=' Assign field to value in create or update contexts. Non-string - values often can be infered by field type + values often can be inferred by field type ':=' Assign field to non-string value in create or update contexts @@ -92,7 +92,7 @@ Small details (mostly TODO): '@' Form field Output goes to stdout (pretty-printed), unless specified to `--download / -d`), -in which case output file is infered, or `--output` sets it explicitly. +in which case output file is inferred, or `--output` sets it explicitly. ### Internet Archive `ia` Tool diff --git a/proposals/2020_fuzzy_matching.md b/proposals/2020_fuzzy_matching.md index 30c321e3..e84c2bd2 100644 --- a/proposals/2020_fuzzy_matching.md +++ b/proposals/2020_fuzzy_matching.md @@ -244,7 +244,7 @@ use-cases: Optionally, we could also architect/design this tool to replace biblio-glutton for ingest-time "reference consolidation", by exposing a biblio-glutton compatible API. If this isn't possible or hard it could become a later tool -instead. Eg, shouldn't sacrafice batch performance for this. In particular, for +instead. Eg, shouldn't sacrifice batch performance for this. In particular, for ingest-time reference matching we'd want the backing corpus to be updated continuously, which might be tricky or in conflict with batch-mode design. @@ -289,7 +289,7 @@ reading the Scala and Python source ## Longtail OA Import Filtering -Not direcly related to matching, but filtering mixed-quality metadata. +Not directly related to matching, but filtering mixed-quality metadata. As part of Longtail OA preservation work, we ran a crawl of small OA journal websites, and then ran GROBID over the resulting PDFs to extract metadata. We @@ -383,7 +383,7 @@ indices. It is also possible to iterate over both indices by bucket and doing further processing between all the papers, then combined the matches/groups from both iterations. The reason for using two indices is to be robust against mangled metadata where there is added junk or missing words at either the -begining or end of the title. +beginning or end of the title. To verify candidate pairs, the Jaccard similarity is calculated between the full original title strings. This flexibly allows for character typos (human or diff --git a/proposals/2020_metadata_cleanups.md b/proposals/2020_metadata_cleanups.md index cf6b08e5..b95f6579 100644 --- a/proposals/2020_metadata_cleanups.md +++ b/proposals/2020_metadata_cleanups.md @@ -88,7 +88,7 @@ At some point, had many "NULL" publishers. "Type" coverage should be improved. -"Publisher type" (infered in various ways in chocula tool) could be included in +"Publisher type" (inferred in various ways in chocula tool) could be included in `extra` and end up in search faceting. Overall OA status should probably be more sophisticated: gold, green, etc. diff --git a/proposals/2021-01-29_citation_api.md b/proposals/2021-01-29_citation_api.md index 3805dcac..6379da09 100644 --- a/proposals/2021-01-29_citation_api.md +++ b/proposals/2021-01-29_citation_api.md @@ -212,7 +212,7 @@ would make "outbound" queries a trivial key lookup, instead of a query by rows would be returned, with unwanted metadata. Another alternative design would be storing more metadata about source and -target in each row. This would remove the ned to do separate +target in each row. This would remove the need to do separate "hydration"/"enrich" fetches. This would probably blow up in the index size though, and would require more aggressive re-indexing (in a live-updated scenario). Eg, when a new fulltext file is updated (access option), would need diff --git a/proposals/README.md b/proposals/README.md index 5e6747b1..31184fe3 100644 --- a/proposals/README.md +++ b/proposals/README.md @@ -6,6 +6,6 @@ is large enough to require planning and documentation. Each should be tagged with a date first drafted, and labeled with a status: - brainstorm: just putting ideas down; might not even happen -- planned: commited to happening, but not started yet +- planned: committed to happening, but not started yet - work-in-progress: currently being worked on - implemented: completed, merged to master/production/live diff --git a/python/README_import.md b/python/README_import.md index 74e75e14..1d54f9d7 100644 --- a/python/README_import.md +++ b/python/README_import.md @@ -140,7 +140,7 @@ Takes a few hours. ## dblp See `extra/dblp/README.md` for notes about first importing container metadata -and getting a TSV mapping flie to help with import. This is needed because +and getting a TSV mapping file to help with import. This is needed because there is not (yet) a lookup mechanism for `dblp_prefix` as an identifier of container entities. diff --git a/python/fatcat_tools/harvest/pubmed.py b/python/fatcat_tools/harvest/pubmed.py index 560427fb..78b1755b 100644 --- a/python/fatcat_tools/harvest/pubmed.py +++ b/python/fatcat_tools/harvest/pubmed.py @@ -279,7 +279,7 @@ def ftpretr( "ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed20n1016.xml.gz") to a local temporary file. Returns the name of the local, closed temporary file. - It is the reponsibility of the caller to cleanup the temporary file. + It is the responsibility of the caller to cleanup the temporary file. Implements a basic retry mechanism, e.g. that became an issue in 08/2021, when we encountered EOFError while talking to the FTP server. Retry delay in seconds. diff --git a/python/fatcat_tools/importers/common.py b/python/fatcat_tools/importers/common.py index e2157ee5..cd51a24c 100644 --- a/python/fatcat_tools/importers/common.py +++ b/python/fatcat_tools/importers/common.py @@ -432,7 +432,7 @@ class EntityImporter: - WEAK - AMBIGUOUS - Eg, if there is any EXACT match that is always returned; an AMBIGIOUS + Eg, if there is any EXACT match that is always returned; an AMBIGUOUS result is only returned if all the candidate matches were ambiguous. """ @@ -725,7 +725,7 @@ class KafkaBs4XmlPusher(RecordPusher): while True: # Note: this is batch-oriented, because underlying importer is # often batch-oriented, but this doesn't confirm that entire batch - # has been pushed to fatcat before commiting offset. Eg, consider + # has been pushed to fatcat before committing offset. Eg, consider # case where there there is one update and thousands of creates; # update would be lingering in importer, and if importer crashed # never created. diff --git a/python/fatcat_web/entity_helpers.py b/python/fatcat_web/entity_helpers.py index 2e3b83c5..285513a8 100644 --- a/python/fatcat_web/entity_helpers.py +++ b/python/fatcat_web/entity_helpers.py @@ -92,7 +92,7 @@ def enrich_release_entity(entity: ReleaseEntity) -> ReleaseEntity: # November 1. if ref.extra and ref.extra.get("unstructured"): ref.extra["unstructured"] = strip_extlink_xml(ref.extra["unstructured"]) - # for backwards compatability, copy extra['subtitle'] to subtitle + # for backwards compatibility, copy extra['subtitle'] to subtitle if not entity.subtitle and entity.extra and entity.extra.get("subtitle"): if isinstance(entity.extra["subtitle"], str): entity.subtitle = entity.extra["subtitle"] diff --git a/python/fatcat_web/search.py b/python/fatcat_web/search.py index fdfc4d80..b9994f28 100644 --- a/python/fatcat_web/search.py +++ b/python/fatcat_web/search.py @@ -161,8 +161,8 @@ def agg_to_dict(agg: Any) -> Dict[str, Any]: """ Takes a simple term aggregation result (with buckets) and returns a simple dict with keys as terms and counts as values. Includes an extra value - '_other', and by convention aggregations should be writen to have "missing" - vaules as '_unknown'. + '_other', and by convention aggregations should be written to have "missing" + values as '_unknown'. """ result = dict() for bucket in agg.buckets: diff --git a/python/fatcat_web/templates/container_create.html b/python/fatcat_web/templates/container_create.html index be8c5671..2a705ffd 100644 --- a/python/fatcat_web/templates/container_create.html +++ b/python/fatcat_web/templates/container_create.html @@ -18,7 +18,7 @@ book series, or a blog. Not all publications are in a container. <input class="ui primary submit button" type="submit" value="Create Container!"> <p> <i>New container entity will be part of the current editgroup, which needs to be - submited and approved before the entity will formally be included in the + submitted and approved before the entity will formally be included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/container_edit.html b/python/fatcat_web/templates/container_edit.html index 1885197c..1c6f32e4 100644 --- a/python/fatcat_web/templates/container_edit.html +++ b/python/fatcat_web/templates/container_edit.html @@ -70,7 +70,7 @@ <br><br> <input class="ui primary submit button" type="submit" value="Update Container!"> <p> - <i>Edit will be part of the current editgroup, which needs to be submited and + <i>Edit will be part of the current editgroup, which needs to be submitted and approved before the change is included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/entity_create_toml.html b/python/fatcat_web/templates/entity_create_toml.html index ec5bc4a2..2fd9a2bb 100644 --- a/python/fatcat_web/templates/entity_create_toml.html +++ b/python/fatcat_web/templates/entity_create_toml.html @@ -12,7 +12,7 @@ <input class="ui primary submit button" type="submit" value="Create {{ entity_type }}!"> <p> <i>New {{ entity_type }} entity will be part of the current editgroup, which - needs to be submited and approved before the entity will formally be included + needs to be submitted and approved before the entity will formally be included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/entity_delete.html b/python/fatcat_web/templates/entity_delete.html index 85742bb3..98b6b8e6 100644 --- a/python/fatcat_web/templates/entity_delete.html +++ b/python/fatcat_web/templates/entity_delete.html @@ -31,7 +31,7 @@ <br><br> <input class="ui primary submit button" type="submit" value="Update Release!"> <p> - <i>Deletion will be part of the current editgroup, which needs to be submited and + <i>Deletion will be part of the current editgroup, which needs to be submitted and approved before the change is included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/entity_edit_toml.html b/python/fatcat_web/templates/entity_edit_toml.html index b0252c82..6e99c402 100644 --- a/python/fatcat_web/templates/entity_edit_toml.html +++ b/python/fatcat_web/templates/entity_edit_toml.html @@ -33,7 +33,7 @@ <br><br> <input class="ui primary submit button" type="submit" value="Update Release!"> <p> - <i>Edit will be part of the current editgroup, which needs to be submited and + <i>Edit will be part of the current editgroup, which needs to be submitted and approved before the change is included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/file_create.html b/python/fatcat_web/templates/file_create.html index affcfb6e..29612d0e 100644 --- a/python/fatcat_web/templates/file_create.html +++ b/python/fatcat_web/templates/file_create.html @@ -14,7 +14,7 @@ <input class="ui primary submit button" type="submit" value="Create File!"> <p> <i>New file entity will be part of the current editgroup, which needs to be - submited and approved before the entity will formally be included in the + submitted and approved before the entity will formally be included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/file_edit.html b/python/fatcat_web/templates/file_edit.html index de16e59e..eeb25a9d 100644 --- a/python/fatcat_web/templates/file_edit.html +++ b/python/fatcat_web/templates/file_edit.html @@ -100,7 +100,7 @@ <br><br> <input class="ui primary submit button" type="submit" value="Update File!"> <p> - <i>Edit will be part of the current editgroup, which needs to be submited and + <i>Edit will be part of the current editgroup, which needs to be submitted and approved before the change is included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/home.html b/python/fatcat_web/templates/home.html index acb943d9..5c8c33ba 100644 --- a/python/fatcat_web/templates/home.html +++ b/python/fatcat_web/templates/home.html @@ -240,7 +240,7 @@ <br><a href="/file/lookup">Other Hashes</a> </form> <tr><td><b>File Set</b> - <br>datasets, suplementary materials + <br>datasets, supplementary materials <td><a href="/fileset/create">Create</a> {% if config.FATCAT_DOMAIN == 'fatcat.wiki' %} <td><a href="/fileset/ho376wmdanckpp66iwfs7g22ne">Dataset</a> diff --git a/python/fatcat_web/templates/openlibrary_view_fuzzy_refs.html b/python/fatcat_web/templates/openlibrary_view_fuzzy_refs.html index 21bf76f2..e9444b75 100644 --- a/python/fatcat_web/templates/openlibrary_view_fuzzy_refs.html +++ b/python/fatcat_web/templates/openlibrary_view_fuzzy_refs.html @@ -16,7 +16,7 @@ <p>This page lists references to this book from other works (eg, journal articles). {% elif direction == "out" %} <h3>References</h3> - <i>Refernces from this book to other entities.</i> + <i>References from this book to other entities.</i> {% endif %} {{ refs_macros.refs_table(hits, direction) }} diff --git a/python/fatcat_web/templates/release_create.html b/python/fatcat_web/templates/release_create.html index 4f5dabd7..09191111 100644 --- a/python/fatcat_web/templates/release_create.html +++ b/python/fatcat_web/templates/release_create.html @@ -14,7 +14,7 @@ <input class="ui primary submit button" type="submit" value="Create Release!"> <p> <i>New release entity will be part of the current editgroup, which needs to be - submited and approved before the entity will formally be included in the + submitted and approved before the entity will formally be included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/release_edit.html b/python/fatcat_web/templates/release_edit.html index 0ac94be9..3f5c10f6 100644 --- a/python/fatcat_web/templates/release_edit.html +++ b/python/fatcat_web/templates/release_edit.html @@ -105,7 +105,7 @@ <br> <br> - <h3 class="ui dividing header">Identifers</h3> + <h3 class="ui dividing header">Identifiers</h3> <br> {{ edit_macros.form_field_inline(form.doi) }} {{ edit_macros.form_field_inline(form.wikidata_qid) }} @@ -148,7 +148,7 @@ <br><br> <input class="ui primary submit button" type="submit" value="Update Release!"> <p> - <i>Edit will be part of the current editgroup, which needs to be submited and + <i>Edit will be part of the current editgroup, which needs to be submitted and approved before the change is included in the catalog.</i> </form> </div> diff --git a/python/fatcat_web/templates/release_lookup.html b/python/fatcat_web/templates/release_lookup.html index a0ef3bb3..20821a10 100644 --- a/python/fatcat_web/templates/release_lookup.html +++ b/python/fatcat_web/templates/release_lookup.html @@ -49,7 +49,7 @@ you don't know the version, you can append "v1" to get the first version. <h2>DOI</h2> <p><a href="https://en.wikipedia.org/wiki/Digital_object_identifier"> -Digital object identifer</a>: "it's not an identifier for a digital object, +Digital object identifier</a>: "it's not an identifier for a digital object, it's a digital identifier for an object". Except they are pretty much all digital objects. Fatcat doesn't include all DOIs (eg, for granular components or TV shows), but it should for all complete research publications. DOIs are diff --git a/python/fatcat_web/templates/rfc.html b/python/fatcat_web/templates/rfc.html index c7e7149f..fba6eff3 100644 --- a/python/fatcat_web/templates/rfc.html +++ b/python/fatcat_web/templates/rfc.html @@ -25,7 +25,7 @@ <p>As little "application logic" as possible should be embedded in this back-end; as much as possible would be pushed to bots which could be authored and operated by anybody. A separate web interface project talks to the API backend and can be developed more rapidly with less concern about data loss or corruption.</p> <p>A cronjob will creae periodic database dumps, both in "full" form (all tables and all edit history, removing only authentication credentials) and "flattened" form (with only the most recent version of each entity).</p> <p>A goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not necessarily "first". It should be possible to export the database in a relatively clean RDF form, and to fetch data in a variety of formats, but internally fatcat will not be backed by a triple-store, and will not be bound to a rigid third-party ontology or schema.</p> -<p>Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the databse in those formats.</p> +<p>Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the database in those formats.</p> <h2 id="licensing">Licensing</h2> <p>The core fatcat database should only contain verifiable factual statements (which isn't to say that all statements are "true"), not creative or derived content.</p> <p>The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal here isn't to avoid attribution (provenance information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and developers.</p> @@ -33,7 +33,7 @@ <h2 id="basic-editing-workflow-and-bots">Basic Editing Workflow and Bots</h2> <p>Both human editors and bots should have edits go through the same API, with humans using either the default web interface, integrations, or client software.</p> <p>The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an "edit group" of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor would "submit" the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (72 hours?) with no changes and no blocking issues, the edit group would be auto-accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).</p> -<p>Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts managable.</p> +<p>Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts manageable.</p> <p>Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors would leave edit messages to clarify their sources.</p> <p>A style guide (wiki) and discussion forum would be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (oauth?) to have consistent account IDs across all mediums.</p> <h2 id="global-edit-changelog">Global Edit Changelog</h2> @@ -47,13 +47,13 @@ https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz</code></pre> <p>In comparison, 96-bit identifiers would have 20 characters and look like:</p> <pre><code>work_rzga5b9cd7efgh04iljk https://fatcat.wiki/work/rzga5b9cd7efgh04iljk</code></pre> -<p>A 64-bit namespace would probably be large enought, and would work with database Integer columns:</p> +<p>A 64-bit namespace would probably be large enough, and would work with database Integer columns:</p> <pre><code>work_rzga5b9cd7efg https://fatcat.wiki/work/rzga5b9cd7efg</code></pre> <p>The idea would be to only have fatcat identifiers be used to interlink between databases, <em>not</em> to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers.</p> <h2 id="entities-and-internal-schema">Entities and Internal Schema</h2> <p>Internally, identifiers would be lightweight pointers to "revisions" of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems, this is the git model, not the mercurial model.</p> -<p>The entity revisions are immutable once accepted; the editting process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by <em>identifier</em> not <em>revision number</em>. Identifier pointers also support (versioned) deletion and redirects (for merging entities).</p> +<p>The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by <em>identifier</em> not <em>revision number</em>. Identifier pointers also support (versioned) deletion and redirects (for merging entities).</p> <p>Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance).</p> <p>SQL tables would probably look something like the (but specific to each entity type, with tables like <code>work_revision</code> not <code>entity_revision</code>):</p> <pre><code>entity_ident @@ -158,7 +158,7 @@ container (aka "venue", "serial", "title") <h2 id="controlled-vocabularies">Controlled Vocabularies</h2> <p>Some special namespace tables and enums would probably be helpful; these could live in the database (not requiring a database migration to update), but should have more controlled editing workflow... perhaps versioned in the codebase:</p> <ul> -<li>identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers themselves)</li> +<li>identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifiers themselves)</li> <li>subject categorization</li> <li>license and open access status</li> <li>work "types" (article vs. book chapter vs. proceeding, etc)</li> diff --git a/python/fatcat_web/templates/wikipedia_view_fuzzy_refs.html b/python/fatcat_web/templates/wikipedia_view_fuzzy_refs.html index 3e1453c1..2d2627b1 100644 --- a/python/fatcat_web/templates/wikipedia_view_fuzzy_refs.html +++ b/python/fatcat_web/templates/wikipedia_view_fuzzy_refs.html @@ -14,7 +14,7 @@ <p>This page lists references to a wikipedia article, from other works (eg, journal articles). {% elif direction == "out" %} <h3>References</h3> - <i>Refernces from wikipedia article to other entities.</i> + <i>References from wikipedia article to other entities.</i> {% endif %} {{ refs_macros.refs_table(hits, direction) }} diff --git a/rust/HACKING.md b/rust/HACKING.md index c321cded..fbdeb499 100644 --- a/rust/HACKING.md +++ b/rust/HACKING.md @@ -26,7 +26,7 @@ are verbose and implemented in a very mechanical fashion. The return type mapping in `api_wrappers` might be necessary, but `database_models.rs` in particular feels unnecessary; other projects have attempted to completely automate generation of this file, but it doesn't sound reliable. In particular, -both regular "Row" (queriable) and "NewRow" (insertable) structs need to be +both regular "Row" (queryable) and "NewRow" (insertable) structs need to be defined. ## Test Structure diff --git a/rust/README.md b/rust/README.md index 6f213629..36061240 100644 --- a/rust/README.md +++ b/rust/README.md @@ -71,7 +71,7 @@ All configuration goes through environment variables, the notable ones being: - `TEST_DATABASE_URL`: used when running `cargo test` - `AUTH_LOCATION`: the domain authentication tokens should be valid over - `AUTH_KEY_IDENT`: a unique name for the primary auth signing key (used to - find the correct key after key rotation has occured) + find the correct key after key rotation has occurred) - `AUTH_SECRET_KEY`: base64-encoded secret key used to both sign and verify authentication tokens (symmetric encryption) - `AUTH_ALT_KEYS`: additional ident/key pairs that can be used to verify tokens @@ -28,7 +28,7 @@ later: https://github.com/jkcclemens/paste/blob/942d1ede8abe80a594553197f2b03c1d6d70efd0/webserver/build.rs https://github.com/jkcclemens/paste/blob/942d1ede8abe80a594553197f2b03c1d6d70efd0/webserver/src/main.rs#L44 - "prev_rev" required in updates -- tried using sync::Once to wrap test database initilization (so it would only +- tried using sync::Once to wrap test database initialization (so it would only run migrations once), but it didn't seem to work, maybe I had a bug or it didn't compile? => could also do a global mutex: https://github.com/SergioBenitez/Rocket/issues/697 |