Merge branch 'bnewbold-cleanups-nov2021' into 'master'

Fatcat metadata cleanups/fixups, November 2021 Three cleanups implemented in this branch: - update non-lowercase DOIs on releases (couple hundred thousand entities) - fix incorrectly imported file/release pairs, on the file entity side (~250k entities) - expand truncated wayback URL timestamps in file entities (up to 10 million entities) Instead of proposals, there are documents for each cleanup in `notes/cleanups/`. Have done spot testing of tens of thousands of entities each in QA, and confident about running in production. Plan is to run updates in the order above. DOI and bugfix updates will go fairly fast; the wayback timestamp updates will go slower, and result in large re-indexing load both in fatcat and scholar, because both release and work entities will get triggered for update when file entities are updated.
author: bnewbold <bnewbold@archive.org> 2021-11-11 01:11:49 +0000
committer: bnewbold <bnewbold@archive.org> 2021-11-11 01:11:49 +0000
commit: 7e3f91f1a49ea85707cae31125021ba761f5373d (patch)
tree: 34c482d15821765ffd7a27f6f049c320a2bf4b2a /notes/cleanups/case_sensitive_dois.md
parent: b6d228b7171252c8f9f70194c09aba0ed0c55567 (diff)
parent: cd09c6d6bd4deef0627de4f8a8a301725db01e14 (diff)
download: fatcat-7e3f91f1a49ea85707cae31125021ba761f5373d.tar.gz
fatcat-7e3f91f1a49ea85707cae31125021ba761f5373d.zip
1 files changed, 71 insertions, 0 deletions
diff --git a/notes/cleanups/case_sensitive_dois.md b/notes/cleanups/case_sensitive_dois.md
new file mode 100644
index 00000000..1bf1901e
--- /dev/null
+++ b/notes/cleanups/case_sensitive_dois.md
@@ -0,0 +1,71 @@
+
+Relevant github issue: https://github.com/internetarchive/fatcat/issues/83
+
+How many existing fatcat releases have a non-lowercase DOI? As of June 2021:
+
+    zcat release_extid.tsv.gz | cut -f3 | rg '[A-Z]' | pv -l | wc -l
+    139964
+
+## Prep
+
+    wget https://archive.org/download/fatcat_bulk_exports_2021-11-05/release_extid.tsv.gz
+
+    # scratch:bin/fcid.py is roughly the same as `fatcat_util.py uuid2fcid`
+
+    zcat release_extid.tsv.gz \
+        | cut -f1,3 \
+        | rg '[A-Z]' \
+        | /fast/scratch/bin/fcid.py \
+        | pv -l \
+        > nonlowercase_doi_releases.tsv
+    # 140k 0:03:54 [ 599 /s]
+
+    wc -l nonlowercase_doi_releases.tsv
+    140530 nonlowercase_doi_releases.tsv
+
+Uhoh, there are ~500 more than previously? Guess those are from after the fix?
+
+Create a sample for testing:
+
+    shuf -n10000 nonlowercase_doi_releases.tsv \
+        > nonlowercase_doi_releases.10k_sample.tsv
+
+## Test in QA
+
+In pipenv:
+
+    export FATCAT_AUTH_WORKER_CLEANUP=[...]
+
+    head -n100 /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
+        | python -m fatcat_tools.cleanups.release_lowercase_doi -
+    # Counter({'total': 100, 'update': 100, 'skip': 0, 'insert': 0, 'exists': 0})
+
+    head -n100 /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
+        | python -m fatcat_tools.cleanups.release_lowercase_doi -
+    # Counter({'total': 100, 'skip-existing-doi-fine': 100, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0})
+
+    head -n2000 /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
+        | python -m fatcat_tools.cleanups.release_lowercase_doi -
+    # no such release_ident found: dcjsybvqanffhmu4dhzdnptave
+
+Presumably because this is being run in QA, and there are some newer prod releases in the snapshot.
+
+Did a quick update, and then:
+
+    head -n2000 /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
+        | python -m fatcat_tools.cleanups.release_lowercase_doi -
+    # Counter({'total': 2000, 'skip-existing-doi-fine': 1100, 'update': 898, 'skip-existing-not-found': 2, 'skip': 0, 'insert': 0, 'exists': 0})
+
+Did some spot checking in QA. Out of 20 DOIs checked, 15 were valid, 5 were not
+valid (doi.org 404). It seems like roughly 1/3 have a dupe DOI (the lower-case
+DOI exists); didn't count exact numbers.
+
+This cleanup is simple and looks good to go. Batch size of 50 is good for full
+releases.
+
+Example of parallelization:
+
+    cat /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
+        | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.release_lowercase_doi -
+
+Ready to go!
author	bnewbold <bnewbold@archive.org>	2021-11-11 01:11:49 +0000
committer	bnewbold <bnewbold@archive.org>	2021-11-11 01:11:49 +0000
commit	7e3f91f1a49ea85707cae31125021ba761f5373d (patch)
tree	34c482d15821765ffd7a27f6f049c320a2bf4b2a /notes/cleanups/case_sensitive_dois.md
parent	b6d228b7171252c8f9f70194c09aba0ed0c55567 (diff)
parent	cd09c6d6bd4deef0627de4f8a8a301725db01e14 (diff)
download	fatcat-7e3f91f1a49ea85707cae31125021ba761f5373d.tar.gz fatcat-7e3f91f1a49ea85707cae31125021ba761f5373d.zip