diff options
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/20190510_release_ext_ids.md | 2 | ||||
-rw-r--r-- | proposals/202008_bulk_citation_graph.md | 2 | ||||
-rw-r--r-- | proposals/2020_client_cli.md | 4 | ||||
-rw-r--r-- | proposals/2020_fuzzy_matching.md | 6 | ||||
-rw-r--r-- | proposals/2020_metadata_cleanups.md | 2 | ||||
-rw-r--r-- | proposals/2021-01-29_citation_api.md | 2 | ||||
-rw-r--r-- | proposals/README.md | 2 |
7 files changed, 10 insertions, 10 deletions
diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md index 8953448c..b0a484ad 100644 --- a/proposals/20190510_release_ext_ids.md +++ b/proposals/20190510_release_ext_ids.md @@ -23,7 +23,7 @@ sure this is worth it though. ## New API -All identifers as text +All identifiers as text release_entity ext_ids (required) diff --git a/proposals/202008_bulk_citation_graph.md b/proposals/202008_bulk_citation_graph.md index f8868e45..65db0d94 100644 --- a/proposals/202008_bulk_citation_graph.md +++ b/proposals/202008_bulk_citation_graph.md @@ -43,7 +43,7 @@ The high-level prosposal is: types - sort the "source" references into an index and run a merge-sort on bucket keys against the "target" index to generate candidate match buckets -- run python fuzzy match code against the candidate buckets, outputing a status +- run python fuzzy match code against the candidate buckets, outputting a status for each reference input and a list of all strong matches - resort successful matches and index by both source and target identifiers as output citation graph diff --git a/proposals/2020_client_cli.md b/proposals/2020_client_cli.md index 2a0c8fa1..01d190a8 100644 --- a/proposals/2020_client_cli.md +++ b/proposals/2020_client_cli.md @@ -69,7 +69,7 @@ Argument conventions: ':' Lookup specifier for entity (eg, external identifier like `doi:10.123/abc`) '=' Assign field to value in create or update contexts. Non-string - values often can be infered by field type + values often can be inferred by field type ':=' Assign field to non-string value in create or update contexts @@ -92,7 +92,7 @@ Small details (mostly TODO): '@' Form field Output goes to stdout (pretty-printed), unless specified to `--download / -d`), -in which case output file is infered, or `--output` sets it explicitly. +in which case output file is inferred, or `--output` sets it explicitly. ### Internet Archive `ia` Tool diff --git a/proposals/2020_fuzzy_matching.md b/proposals/2020_fuzzy_matching.md index 30c321e3..e84c2bd2 100644 --- a/proposals/2020_fuzzy_matching.md +++ b/proposals/2020_fuzzy_matching.md @@ -244,7 +244,7 @@ use-cases: Optionally, we could also architect/design this tool to replace biblio-glutton for ingest-time "reference consolidation", by exposing a biblio-glutton compatible API. If this isn't possible or hard it could become a later tool -instead. Eg, shouldn't sacrafice batch performance for this. In particular, for +instead. Eg, shouldn't sacrifice batch performance for this. In particular, for ingest-time reference matching we'd want the backing corpus to be updated continuously, which might be tricky or in conflict with batch-mode design. @@ -289,7 +289,7 @@ reading the Scala and Python source ## Longtail OA Import Filtering -Not direcly related to matching, but filtering mixed-quality metadata. +Not directly related to matching, but filtering mixed-quality metadata. As part of Longtail OA preservation work, we ran a crawl of small OA journal websites, and then ran GROBID over the resulting PDFs to extract metadata. We @@ -383,7 +383,7 @@ indices. It is also possible to iterate over both indices by bucket and doing further processing between all the papers, then combined the matches/groups from both iterations. The reason for using two indices is to be robust against mangled metadata where there is added junk or missing words at either the -begining or end of the title. +beginning or end of the title. To verify candidate pairs, the Jaccard similarity is calculated between the full original title strings. This flexibly allows for character typos (human or diff --git a/proposals/2020_metadata_cleanups.md b/proposals/2020_metadata_cleanups.md index cf6b08e5..b95f6579 100644 --- a/proposals/2020_metadata_cleanups.md +++ b/proposals/2020_metadata_cleanups.md @@ -88,7 +88,7 @@ At some point, had many "NULL" publishers. "Type" coverage should be improved. -"Publisher type" (infered in various ways in chocula tool) could be included in +"Publisher type" (inferred in various ways in chocula tool) could be included in `extra` and end up in search faceting. Overall OA status should probably be more sophisticated: gold, green, etc. diff --git a/proposals/2021-01-29_citation_api.md b/proposals/2021-01-29_citation_api.md index 3805dcac..6379da09 100644 --- a/proposals/2021-01-29_citation_api.md +++ b/proposals/2021-01-29_citation_api.md @@ -212,7 +212,7 @@ would make "outbound" queries a trivial key lookup, instead of a query by rows would be returned, with unwanted metadata. Another alternative design would be storing more metadata about source and -target in each row. This would remove the ned to do separate +target in each row. This would remove the need to do separate "hydration"/"enrich" fetches. This would probably blow up in the index size though, and would require more aggressive re-indexing (in a live-updated scenario). Eg, when a new fulltext file is updated (access option), would need diff --git a/proposals/README.md b/proposals/README.md index 5e6747b1..31184fe3 100644 --- a/proposals/README.md +++ b/proposals/README.md @@ -6,6 +6,6 @@ is large enough to require planning and documentation. Each should be tagged with a date first drafted, and labeled with a status: - brainstorm: just putting ideas down; might not even happen -- planned: commited to happening, but not started yet +- planned: committed to happening, but not started yet - work-in-progress: currently being worked on - implemented: completed, merged to master/production/live |