From fafc32e0ea1adc95eea817af7273d4c47422b364 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 24 Nov 2021 15:48:42 -0800 Subject: codepsell fixes to notes --- notes/UNSORTED.txt | 4 ++-- notes/bulk_edits/2019-10-08_file_cleanups.md | 2 +- notes/bulk_edits/2020-03-19_arxiv_pubmed.md | 2 +- notes/bulk_edits/2020-09-02_file_meta.md | 2 +- notes/bulk_edits/2020-12-23_dblp.md | 2 +- notes/bulk_edits/2020_datacite.md | 2 +- notes/cleanups/wayback_timestamps.md | 4 ++-- notes/data_model.md | 4 ++-- notes/performance/postgres_performance.txt | 2 +- 9 files changed, 12 insertions(+), 12 deletions(-) (limited to 'notes') diff --git a/notes/UNSORTED.txt b/notes/UNSORTED.txt index 3960f5eb..850b54d0 100644 --- a/notes/UNSORTED.txt +++ b/notes/UNSORTED.txt @@ -3,7 +3,7 @@ Not allowed to PUT edits to the same entity in the same editgroup. If you want to update an edit, need to delete the old one first. The state depends only on the current entity state, not any redirect. This -means that if the target of a redirect is delted, the redirecting entity is +means that if the target of a redirect is deleted, the redirecting entity is still "redirect", not "deleted". Redirects-to-redirects are not allowed; this is enforced when the editgroup is @@ -31,7 +31,7 @@ redirects after some delay period. => it would not be too hard to update get_release_files to check for such redirects; could be handled by request flag? -`prev_rev` is naively set to the most-recent previous state. If the curent +`prev_rev` is naively set to the most-recent previous state. If the current state was deleted or a redirect, it is set to null. This parameter is not checked/enforced at edit accept time (but could be, and diff --git a/notes/bulk_edits/2019-10-08_file_cleanups.md b/notes/bulk_edits/2019-10-08_file_cleanups.md index b61b37f0..2eebb363 100644 --- a/notes/bulk_edits/2019-10-08_file_cleanups.md +++ b/notes/bulk_edits/2019-10-08_file_cleanups.md @@ -5,7 +5,7 @@ web.archive.org). These URLs were created accidentally during fatcat boostrapping; there are about 300k such file enties to fix. Will also update archive.org link reltype to 'archive' (instead of -'repository'), which is the new prefered style. +'repository'), which is the new preferred style. Generated the set of files to update like: diff --git a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md index b2fd29d5..56e88880 100644 --- a/notes/bulk_edits/2020-03-19_arxiv_pubmed.md +++ b/notes/bulk_edits/2020-03-19_arxiv_pubmed.md @@ -1,7 +1,7 @@ On 2020-03-20, automated daily harvesting and importing of arxiv and pubmed metadata started. In the case of pubmed, updates are enabled, so that recently -created DOI releases get updated with PMID and extra metdata. +created DOI releases get updated with PMID and extra metadata. We also want to do last backfills of metadata since the last import up through the first day updated by the continuous harvester. diff --git a/notes/bulk_edits/2020-09-02_file_meta.md b/notes/bulk_edits/2020-09-02_file_meta.md index 35c4d87f..b0606f2d 100644 --- a/notes/bulk_edits/2020-09-02_file_meta.md +++ b/notes/bulk_edits/2020-09-02_file_meta.md @@ -25,7 +25,7 @@ Partial wayback URL timestamps, for cases where we have the full timestamped URL https://qa.fatcat.wiki/file/k73il3k5hzemtnkqa5qyorg6ci https://qa.fatcat.wiki/file/7hstlrabfjb6vgyph7ntqtpkne -Live-web URLs identical except for http/https flip or other trival things (much less frequent case): +Live-web URLs identical except for http/https flip or other trivial things (much less frequent case): http://eo1.gsfc.nasa.gov/new/validationReport/Technology/JoeCD/asner_etal_PNAS_20041.pdf https://eo1.gsfc.nasa.gov/new/validationReport/Technology/JoeCD/asner_etal_PNAS_20041.pdf diff --git a/notes/bulk_edits/2020-12-23_dblp.md b/notes/bulk_edits/2020-12-23_dblp.md index c3ad0587..a33411cb 100644 --- a/notes/bulk_edits/2020-12-23_dblp.md +++ b/notes/bulk_edits/2020-12-23_dblp.md @@ -52,4 +52,4 @@ Run import: => Counter({'total': 7953365, 'has-doi': 4277307, 'skip': 3097418, 'skip-key-type': 2640968, 'skip-update': 2480449, 'exists': 943800, 'update': 889700, 'insert': 338842, 'skip-arxiv-corr': 312872, 'exists-fuzzy': 203103, 'skip-dblp-container-missing': 143578, 'skip-arxiv': 53, 'skip-title': 1}) Starting database size (roughly): Size: 684.08G -Ending databse size: Size: 690.22G +Ending database size: Size: 690.22G diff --git a/notes/bulk_edits/2020_datacite.md b/notes/bulk_edits/2020_datacite.md index 005841ae..05d09517 100644 --- a/notes/bulk_edits/2020_datacite.md +++ b/notes/bulk_edits/2020_datacite.md @@ -54,7 +54,7 @@ Compare with `--lang-detect`: user 3m5.620s sys 0m13.344s -Not noticable? +Not noticeable? Whole run: diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md index e3ea942d..9db77058 100644 --- a/notes/cleanups/wayback_timestamps.md +++ b/notes/cleanups/wayback_timestamps.md @@ -1,6 +1,6 @@ -At some point, using the arabesque importer (from targetted crawling), we -accidentially imported a bunch of files with wayback URLs that have 12-digit +At some point, using the arabesque importer (from targeted crawling), we +accidentally imported a bunch of files with wayback URLs that have 12-digit timestamps, instead of the full canonical 14-digit timestamps. diff --git a/notes/data_model.md b/notes/data_model.md index 2d2825ae..f13e33cc 100644 --- a/notes/data_model.md +++ b/notes/data_model.md @@ -87,12 +87,12 @@ Each entity type has tables: core representation of a version of the entity _ident - persistant, external identifier + persistent, external identifier allows merging, unmerging, stable cross-entity references _edit represents change metadata for a single change to one ident - needed because an edit alwasy changes ident, but might not change rev + needed because an edit always changes ident, but might not change rev Could someday also have: diff --git a/notes/performance/postgres_performance.txt b/notes/performance/postgres_performance.txt index cd2a5162..ff8fcb3b 100644 --- a/notes/performance/postgres_performance.txt +++ b/notes/performance/postgres_performance.txt @@ -189,7 +189,7 @@ max_wal_size wasn't getting set correctly. The statements taking the most time are the complex inserts (multi-table inserts); they take a fraction of a second though (mean less than a -milisecond). +millisecond). Manifest import runs really slow if release import is concurrent; much faster to wait until release import is done first (like a factor of 10x or more). -- cgit v1.2.3