move 'cleanups' directory from notes to extra/

author: Bryan Newbold <bnewbold@robocracy.org> 2021-11-29 14:33:14 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2021-11-29 14:33:14 -0800
commit: c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (patch)
tree: 7d3934e4922439402f882a374fe477906fd41aae /notes
parent: ec2809ef2ac51c992463839c1e3451927f5e1661 (diff)
download: fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.tar.gz
fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.zip
11 files changed, 0 insertions, 1306 deletions
diff --git a/notes/cleanups/case_sensitive_dois.md b/notes/cleanups/case_sensitive_dois.md
deleted file mode 100644
index 1bf1901e..00000000
--- a/notes/cleanups/case_sensitive_dois.md
+++ /dev/null
@@ -1,71 +0,0 @@
-
-Relevant github issue: https://github.com/internetarchive/fatcat/issues/83
-
-How many existing fatcat releases have a non-lowercase DOI? As of June 2021:
-
-    zcat release_extid.tsv.gz | cut -f3 | rg '[A-Z]' | pv -l | wc -l
-    139964
-
-## Prep
-
-    wget https://archive.org/download/fatcat_bulk_exports_2021-11-05/release_extid.tsv.gz
-
-    # scratch:bin/fcid.py is roughly the same as `fatcat_util.py uuid2fcid`
-
-    zcat release_extid.tsv.gz \
-        | cut -f1,3 \
-        | rg '[A-Z]' \
-        | /fast/scratch/bin/fcid.py \
-        | pv -l \
-        > nonlowercase_doi_releases.tsv
-    # 140k 0:03:54 [ 599 /s]
-
-    wc -l nonlowercase_doi_releases.tsv
-    140530 nonlowercase_doi_releases.tsv
-
-Uhoh, there are ~500 more than previously? Guess those are from after the fix?
-
-Create a sample for testing:
-
-    shuf -n10000 nonlowercase_doi_releases.tsv \
-        > nonlowercase_doi_releases.10k_sample.tsv
-
-## Test in QA
-
-In pipenv:
-
-    export FATCAT_AUTH_WORKER_CLEANUP=[...]
-
-    head -n100 /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
-        | python -m fatcat_tools.cleanups.release_lowercase_doi -
-    # Counter({'total': 100, 'update': 100, 'skip': 0, 'insert': 0, 'exists': 0})
-
-    head -n100 /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
-        | python -m fatcat_tools.cleanups.release_lowercase_doi -
-    # Counter({'total': 100, 'skip-existing-doi-fine': 100, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0})
-
-    head -n2000 /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
-        | python -m fatcat_tools.cleanups.release_lowercase_doi -
-    # no such release_ident found: dcjsybvqanffhmu4dhzdnptave
-
-Presumably because this is being run in QA, and there are some newer prod releases in the snapshot.
-
-Did a quick update, and then:
-
-    head -n2000 /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
-        | python -m fatcat_tools.cleanups.release_lowercase_doi -
-    # Counter({'total': 2000, 'skip-existing-doi-fine': 1100, 'update': 898, 'skip-existing-not-found': 2, 'skip': 0, 'insert': 0, 'exists': 0})
-
-Did some spot checking in QA. Out of 20 DOIs checked, 15 were valid, 5 were not
-valid (doi.org 404). It seems like roughly 1/3 have a dupe DOI (the lower-case
-DOI exists); didn't count exact numbers.
-
-This cleanup is simple and looks good to go. Batch size of 50 is good for full
-releases.
-
-Example of parallelization:
-
-    cat /srv/fatcat/datasets/nonlowercase_doi_releases.10k_sample.tsv \
-        | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.release_lowercase_doi -
-
-Ready to go!
diff --git a/notes/cleanups/container_issnl_dedupe.md b/notes/cleanups/container_issnl_dedupe.md
deleted file mode 100644
index a76bc961..00000000
--- a/notes/cleanups/container_issnl_dedupe.md
+++ /dev/null
@@ -1,105 +0,0 @@
-
-Simply de-duplicating container entities on the basis of ISSN-L.
-
-Initial plan is to:
-
-- only merge containers with zero (0) release entities pointing at them
-- not update any containers which have had human edits
-- not merge additional metadata from redirected entities to the "primary" entity
-
-
-## Prep
-
-Using commands from `check_issnl.sh`:
-
-    zcat container_export.json.gz \
-        | jq '[.issnl, .ident] | @tsv' -r \
-        | sort -S 4G \
-        | uniq -D -w 9 \
-        > issnl_ident.dupes.tsv
-
-    wc -l issnl_ident.dupes.tsv
-    # 3174 issnl_ident.dupes.tsv
-
-    cut -f1 issnl_ident.dupes.tsv | uniq | wc -l
-    # 835
-
-Run transform script:
-
-    cat issnl_ident.dupes.tsv | ./container_dupe_to_json.py | pv -l > container_issnl_dupes.json
-
-Create a small random sample:
-
-    shuf -n100 container_issnl_dupes.json > container_issnl_dupes.sample.json
-
-## QA Testing
-
-    git log | head -n1
-    # commit e72d61e60c43911b6d77c4842951441235561dcf
-
-    export FATCAT_AUTH_API_TOKEN=[...]
-
-    head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
-        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
-
-Got various errors and patched them:
-
-    AttributeError: 'EntityHistoryEntry' object has no attribute 'editor'
-
-    requests.exceptions.HTTPError: 404 Client Error: NOT FOUND for url: https://fatcat.wiki/container/%7Bident%7D/stats.json
-
-    fatcat_openapi_client.exceptions.ApiValueError: Missing the required parameter `editgroup_id` when calling `accept_editgroup`
-
-Run again:
-
-    head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
-        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
-    # Running in dry-run mode!
-    # Counter({'updated-entities': 96, 'skip-container-release-count': 84, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
-
-Finally! dry-run mode actually worked. Try entire sample in dry-run:
-
-    cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
-        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
-    # Running in dry-run mode!
-    # Counter({'updated-entities': 310, 'skip-container-release-count': 251, 'lines': 100, 'merged': 100, 'skip': 0, 'updated-total': 0})
-
-How about a small `max-container-releases`:
-
-    cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
-        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" --dry-run merge-containers -
-    # Running in dry-run mode!
-    # Counter({'updated-entities': 310, 'skip-container-release-count': 251, 'lines': 100, 'merged': 100, 'skip': 0, 'updated-total': 0})
-
-Exact same count... maybe something isn't working? Debugged and fixed it.
-
-    requests.exceptions.HTTPError: 503 Server Error: SERVICE UNAVAILABLE for url: https://fatcat.wiki/container/xn7i2sdijzbypcetz77kttj76y/stats.json
-
-    # Running in dry-run mode!
-    # Counter({'updated-entities': 310, 'lines': 100, 'merged': 100, 'skip-container-release-count': 92, 'skip': 0, 'updated-total': 0})
-
-From skimming, it looks like 100 is probably a good cut-off. There are sort of
-a lot of these dupes!
-
-Try some actual merges:
-
-    head -n25 /srv/fatcat/datasets/container_issnl_dupes.sample.json \
-        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers -
-    # Counter({'updated-entities': 96, 'skip-container-release-count': 84, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
-
-Run immediately again:
-
-    # Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'skip-container-release-count': 2, 'merged': 0, 'updated-total': 0})
-
-Run all the samples, with limit of 100 releases:
-
-    cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
-        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers - --max-container-releases 100
-    # Counter({'updated-entities': 214, 'lines': 100, 'merged': 75, 'skip': 25, 'skip-not-active-entity': 25, 'skip-container-release-count': 15, 'updated-total': 0})
-
-Wow, there are going to be a lot of these containers not merged because they
-have so many releases! Will have to do a second, more carefully reviewed (?)
-round of merging.
-
-Unfortunately, not seeing any human-edited container entities here to check if
-that filter is working.
diff --git a/notes/cleanups/double_slash_dois.md b/notes/cleanups/double_slash_dois.md
deleted file mode 100644
index d4e9ded6..00000000
--- a/notes/cleanups/double_slash_dois.md
+++ /dev/null
@@ -1,46 +0,0 @@
-
-Relevant github issue: https://github.com/internetarchive/fatcat/issues/48
-
-
-## Investigate
-
-At least some of these DOIs actually seem valid, like
-`10.1026//1616-1041.3.2.86`. So shouldn't be re-writing them!
-
-    zcat release_extid.tsv.gz \
-        | cut -f1,3 \
-        | rg '\t10\.\d+//' \
-        | wc -l
-    # 59,904
-
-    zcat release_extid.tsv.gz \
-        | cut -f1,3 \
-        | rg '\t10\.\d+//' \
-        | pv -l \
-        > doubleslash_dois.tsv
-
-Which prefixes have the most double slashes?
-
-    cat doubleslash_dois.tsv | cut -f2 | cut -d/ -f1 | sort | uniq -c | sort -nr | head
-      51220 10.1037
-       2187 10.1026
-       1316 10.1024
-        826 10.1027
-        823 10.14505
-        443 10.17010
-        186 10.46925
-        163 10.37473
-        122 10.18376
-        118 10.29392
-        [...]
-
-All of the 10.1037 DOIs seem to be registered with Crossref, and at least some
-have redirects to the not-with-double-slash versions. Not all doi.org lookups
-include a redirect.
-
-I think the "correct thing to do" here is to add special-case handling for the
-pubmed and crossref importers, and in any other case allow double slashes.
-
-Not clear that there are any specific cleanups to be done for now. A broader
-"verify that DOIs are actually valid" push and cleanup would make sense; if
-that happens checking for mangled double-slash DOIs would make sense.
diff --git a/notes/cleanups/file_meta.md b/notes/cleanups/file_meta.md
deleted file mode 100644
index d99e821e..00000000
--- a/notes/cleanups/file_meta.md
+++ /dev/null
@@ -1,152 +0,0 @@
-
-Over 500k file entities still lack complete metadata. For example, SHA-256
-checksums and verified mimetypes.
-
-Presumably these also lack GROBID processing. It seems that most or all of
-these are simply wayback captures with no CDX metadata in sandcrawler-db, so
-they didn't get update in prior cleanups.
-
-Current plan, re-using existing tools and processes, is to:
-
-1. create stub ingest requests containing file idents
-2. process them "locally" on a large VM, in 'bulk' mode; writing output to stdout but using regular grobid and pdfextract "sinks" to Kafka
-3. transform ingest results to a form for existing `file_meta` importer
-4. run imports
-
-The `file_meta` importer requires just the `file_meta` dict from sandcrawler.
-
-## Prep
-
-    zcat file_hashes.tsv.gz | pv -l | rg '\t\t' | wc -l
-    # 521,553
-
-    zcat file_export.json.gz \
-        | rg -v '"sha256":' \
-        | pv -l \
-        | pigz \
-        > files_missing_sha256.json.gz
-    # 521k 0:10:21 [ 839 /s]
-
-Want ingest requests with:
-
-    base_url: str
-    ingest_type: "pdf"
-    link_source: "fatcat"
-    link_source_id: file ident (with "file_" prefix)
-    ingest_request_source: "file-backfill"
-    ext_ids:
-        sha1: str
-
-Use `file2ingestrequest.py` helper:
-
-    zcat files_missing_sha256.json.gz \
-        | ./file2ingestrequest.py \
-        | pv -l \
-        | pigz \
-        > files_missing_sha256.ingest_request.json.gz
-    # 519k 0:00:19 [26.5k/s]
-
-So about 2k filtered out, will investigate later.
-
-    zcat files_missing_sha256.ingest_request.json.gz \
-        | shuf -n1000 \
-        > files_missing_sha256.ingest_request.sample.json
-
-    head -n100 files_missing_sha256.ingest_request.sample.json | ./ingest_tool.py requests --no-spn2 - > sample_results.json
-         4 "no-capture"
-         1 "no-pdf-link"
-        95 "success"
-
-Seems like this is going to be a good start, but will need iteration.
-
-Dev testing:
-
-    head files_missing_sha256.ingest_request.sample.json \
-        | ./ingest_tool.py file-requests-backfill - --kafka-env qa --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 \
-        > out_sample.json
-
-
-## Commands
-
-Production warm-up:
-
-    cat /srv/sandcrawler/tasks/files_missing_sha256.ingest_request.sample.json \
-        | ./ingest_tool.py file-requests-backfill - --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --grobid-host http://localhost:8070 \
-        > /srv/sandcrawler/tasks/files_missing_sha256.ingest_results.sample.json
-
-Production parallel run:
-
-    zcat /srv/sandcrawler/tasks/files_missing_sha256.ingest_request.json \
-        | parallel -j24 --linebuffer --round-robin --pipe ./ingest_tool.py file-requests-backfill - --kafka-env qa --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --grobid-host http://localhost:8070 \
-        > /srv/sandcrawler/tasks/files_missing_sha256.ingest_results.json
-
-Filter and select file meta for import:
-
-    head files_missing_sha256.ingest_results.json \
-        | rg '"sha256hex"' \
-        | jq 'select(.request.ext_ids.sha1 == .file_meta.sha1hex) | .file_meta' -c \
-        > files_missing_sha256.file_meta.json
-    # Worker: Counter({'total': 20925, 'success': 20003, 'no-capture': 545, 'link-loop': 115, 'wrong-mimetype': 104, 'redirect-loop': 46, 'wayback-error': 25, 'null-body': 20, 'no-pdf-link': 18, 'skip-url-blocklist': 17, 'terminal-bad-status': 16, 'cdx-error': 9, 'wayback-content-error': 4, 'blocked-cookie': 3})
-    # [etc]
-
-
-Had some GROBID issues, so are not going to be able to get everything in first
-pass. Merge our partial results, as just `file_meta`:
-
-    cat files_missing_sha256.ingest_results.batch1.json files_missing_sha256.ingest_results.json \
-        | jq .file_meta -c \
-        | rg '"sha256hex"' \
-        | pv -l \
-        > files_missing_sha256.file_meta.json
-    # 386k 0:00:41 [9.34k/s]
-
-A bunch of these will need to be re-run once GROBID is in a healthier place.
-
-Check that we don't have (many) dupes:
-
-    cat files_missing_sha256.file_meta.json \
-        | jq .sha1hex -r \
-        | sort \
-        | uniq -D \
-        | wc -l
-    # 86520
-
-Huh, seems like a weirdly large number. Maybe related to re-crawling? Will need
-to dedupe by sha1hex.
-
-Check how many dupes in original:
-
-    zcat files_missing_sha256.ingest_request.json.gz | jq .ext_ids.sha1 -r | sort | uniq -D | wc -l
-
-That lines up with dupes expected before SHA-1 de-dupe run.
-
-    cat files_missing_sha256.file_meta.json \
-        | sort -u -S 4G \
-        | pv -l \
-        > files_missing_sha256.file_meta.uniq.json
-
-    cat files_missing_sha256.file_meta.uniq.json \
-        | jq .sha1hex -r \
-        | sort \
-        | uniq -D \
-        | wc -l
-    # 0
-
-Have seen a lot of errors like:
-
-    %4|1637808915.562|TERMINATE|rdkafka#producer-1| [thrd:app]: Producer terminating with 1 message (650 bytes) still in queue or transit: use flush() to wait for outstanding message delivery
-
-TODO: add manual `finish()` calls on sinks in tool `run` function
-
-## QA Testing
-
-    export FATCAT_API_AUTH_TOKEN... # sandcrawler-bot
-
-    cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.sample.json \
-        | ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
-    # Counter({'total': 1000, 'update': 503, 'skip-existing-complete': 403, 'skip-no-match': 94, 'skip': 0, 'insert': 0, 'exists': 0})
-
-    head -n1000 /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.json \
-        | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
-    # Counter({'total': 1000, 'update': 481, 'skip-existing-complete': 415, 'skip-no-match': 104, 'skip': 0, 'insert': 0, 'exists': 0})
-
diff --git a/notes/cleanups/file_release_ingest_bug.md b/notes/cleanups/file_release_ingest_bug.md
deleted file mode 100644
index d818905b..00000000
--- a/notes/cleanups/file_release_ingest_bug.md
+++ /dev/null
@@ -1,192 +0,0 @@
-
-We want to find cases where the `file_edit` metadata for a file entity does not match the DOI of the release it is linked to.
-
-## Background
-
-
-Seems to mostly happen when the `link_source_id` DOI is not found during a
-fatcat 'lookup', eg due to upstream DOI metadata not including a 'title' field
-(required by fatcat schema). Many of these seem to be 'journal-issue' or
-something like that. Presumably upstream (unpaywall) crawled and found some PDF
-for the DOI and indexes that.
-
-TODO: what was the specific code bug that caused this? time window of edits?
-https://github.com/internetarchive/fatcat/commit/d8cf02dbdc6848be4de2710e54f4463d702b6e7c
-
-TODO: how many empty-title Crossref DOIs are there? perhaps we should import these after all, in some situations? eg, pull `html_biblio`?
-TODO: examples of other mismatches? edits from the same timespan with mismatching identifiers
-
-## Queries
-
-    file_edit
-        ident_id
-        rev_id
-        extra_json->>'link_source_id'::text
-    file_rev_release
-        file_rev
-        target_release_ident_id
-    release_ident
-        id
-        rev_id
-    release_rev
-        id
-        doi
-
-Query some samples:
-
-    SELECT file_edit.ident_id as file_ident, release_ident.id as release_ident, file_edit.extra_json->>'link_source_id' as file_edit_doi, release_rev.doi as release_doi
-    FROM file_edit
-        LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
-        LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
-        LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
-    WHERE
-        file_edit.extra_json->>'ingest_request_source' = 'unpaywall'
-        AND release_rev.doi != file_edit.extra_json->>'link_source_id'
-    LIMIT 20;
-
-                  file_ident              |            release_ident             |             file_edit_doi             |             release_doi             
-    --------------------------------------+--------------------------------------+---------------------------------------+-------------------------------------
-     8fe6eb2a-da3a-4e1c-b281-85fad3cae601 | 0a0cbabd-3f04-4296-94d5-b61c27fddee4 | 10.7167/2013/520930/dataset/3         | 10.7167/2013/520930
-     cbb2d4e7-a9a7-41df-be1d-faf9dd552ee1 | d7b356c7-de6b-4885-b37d-ea603849a30a | 10.1525/elementa.124.t1               | 10.1525/elementa.124
-     26242648-2a7b-42ea-b2ca-e4fba8f0a2c7 | d2647e07-fbb3-47f5-b06a-de2872bf84ec | 10.1023/a:1000301702170               | 10.1007/bfb0085374
-     f774854a-3939-4aa9-bf73-7e0573908b88 | e19ce792-422d-4feb-92ba-dd8536a60618 | 10.32835/2223-5752.2018.17.14-21      | 10.31652/2412-1142-2018-50-151-156
-     615ddf8c-2d91-4c1c-b04a-76b69f07b300 | e7af8377-9542-4711-9444-872f45521dc1 | 10.5644/sjm.10.1.00                   | 10.5644/sjm.10.1.01
-     43fa1b53-eddb-4f31-96af-bc9bd3bb31b6 | e1ec5d8a-ff88-49f4-8540-73ee4906e119 | 10.1136/bjo.81.11.934                 | 10.1038/sj.bjc.6690790
-     64b04cbc-ab3d-4cef-aff5-9d6b2532c47a | 43c387d1-e4c9-49f3-9995-552451a9a246 | 10.1023/a:1002015330987               | 10.1006/jnth.1998.2271
-     31e6e1a8-8457-4c93-82d3-12a51c0dc1bb | a2ec4305-63d2-4164-8470-02bf5c4ba74c | 10.1186/1742-4690-2-s1-s84            | 10.1038/nm0703-847
-     3461024a-1800-44a0-a7be-73f14bac99dd | 6849d39f-6bae-460b-8a9d-d6b86fbac2d3 | 10.17265/2328-2150/2014.11            | 10.1590/s0074-02762002000900009
-     c2707e43-6a4b-4f5d-8fb4-babe16b42b4d | 82127faf-d125-4cd4-88b9-9a8618392019 | 10.1055/s-005-28993                   | 10.1055/s-0034-1380711
-     0fe72304-1884-4dde-98c6-7cf3aec5bf31 | e3fe46cd-0205-42e4-8255-490e0eba38ea | 10.1787/5k4c1s0z2gs1-en               | 10.1787/9789282105931-2-en
-     686a1508-b6c5-4060-9035-1dd8a4b85be1 | dcecd03d-e0f6-40ce-a376-7e64bd90bdca | 10.15406/jlprr.2.3                    | 10.15406/jlprr.2015.02.00039
-     9403bc0e-2a50-4c8a-ab79-943bce20e583 | 3073dee2-c8f7-42ed-a727-12306d86fa35 | 10.1787/5jrtgzfl6g9w-en               | 10.1111/1468-2443.00004
-     9a080a6f-aaaf-4f4b-acab-3eda2bfacc22 | af28a167-2ddc-43db-98ca-1cc29b863f9f | 10.13040/ijpsr.0975-8232.4(6).2094-05 | 10.1002/chin.201418290
-     810f1a00-623f-4a5c-88e2-7240232ae8f9 | 639d4ddd-ef24-49dc-8d4e-ea626b0b85f8 | 10.13040/ijpsr.0975-8232.5(6).2216-24 | 10.1006/phrs.1997.0184
-                                                                                   10.13040/IJPSR.0975-8232.5(6).2216-24
-     72752c59-9532-467d-8b4f-fe4d994bcff5 | cc3db36c-27e9-45bf-96b0-56d1119b93d6 | 10.22201/facmed.2007865xe.2018.26     | 10.22201/facmed.2007865x.2018.26.01
-     a0740fc2-a1db-4bc8-ac98-177ccebde24f | 0a7136ac-86ad-4636-9c2a-b56a6d5a3a27 | 10.24966/cmph-1978/vol6iss1           | 10.1093/milmed/146.4.283
-     5b05df6a-7a0e-411b-a4e1-0081d7522673 | dc0cb585-c709-4990-a82e-c7c9b254fc74 | 10.15406/jig.3.1                      | 10.15406/jig.2016.03.00039
-     011964dd-24a6-4d25-b9dd-921077f1947e | deabab6d-7381-4567-bf56-5294ab081e20 | 10.17265/1537-1506/2013.01            | 10.1109/icsssm.2016.7538527
-     4d7efa59-30c7-4015-bea9-d7df171a97ed | a05449a0-34d7-4f45-9b92-bcf71db4918d | 10.14413/herj.2017.01.09              | 10.14413/herj.2017.01.09.
-    (20 rows)
-
-Total counts, broken DOIs, by source:
-
-    SELECT file_edit.extra_json->>'ingest_request_source' as source, COUNT(*) as broken_files
-    FROM file_edit
-        LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
-        LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
-        LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
-    WHERE
-        file_edit.extra_json->>'link_source_id' IS NOT NULL
-        AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
-        AND release_rev.doi != file_edit.extra_json->>'link_source_id'
-    GROUP BY file_edit.extra_json->>'ingest_request_source';
-
-
-          source      | broken_files
-    ------------------+--------------
-     fatcat-changelog |          872
-     unpaywall        |       227954
-    (2 rows)
-
-## Cleanup Pseudocode
-
-Dump file idents from above SQL queries. Transform from UUID to fatcat ident.
-
-For each file ident, fetch file entity, with releases expanded. Also fetch edit
-history. Filter by:
-
-- single release for file
-- single edit in history, matching broken agent
-- edit metadata indicates a DOI, which doesn't match release DOI (or release doesn't have DOI?)
-
-If all that is the case, do a lookup by the correct DOI. If there is a match,
-update the file entity to point to that release (and only that release).
-Operate in batches.
-
-Write cleanup bot, run in QA to test. Run SQL query again and iterate if some
-files were not updated. Merge to master, run full dump and update in prod.
-
-## Export Matches Needing Fix
-
-    COPY (
-        SELECT row_to_json(row.*) FROM (
-            SELECT file_edit.ident_id as file_ident, release_ident.id as wrong_release_ident, file_edit.extra_json as edit_extra
-            FROM file_edit
-                LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
-                LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
-                LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
-            WHERE
-                file_edit.extra_json->>'link_source_id' IS NOT NULL
-                AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
-                AND release_rev.doi != file_edit.extra_json->>'link_source_id'
-            -- LIMIT 20;
-        ) as row
-    ) TO '/srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json';
-    # COPY 228,827
-
-Then need to do a `sed` pass, and a `rg` pass, to filter out bad escapes:
-
-    cat /srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json \
-        | sed 's/\\"/\"/g' \
-        | rg -v "\\\\" \
-        | rg '^\{' \
-        | pv -l \
-        > /srv/fatcat/snapshots/file_release_bugfix_20211105.json
-    # 228k 0:00:00 [ 667k/s]
-
-    wc -l /srv/fatcat/snapshots/file_release_bugfix_20211105.json
-    # 228,826
-
-And create a sample file:
-
-    shuf -n10000 /srv/fatcat/snapshots/file_release_bugfix_20211105.json > /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json
-
-
-## Testing in QA
-
-    export FATCAT_AUTH_WORKER_CLEANUP=[...]
-
-    head -n10 /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
-        | python -m fatcat_tools.cleanups.file_release_bugfix -
-    # Counter({'total': 10, 'update': 10, 'skip': 0, 'insert': 0, 'exists': 0})
-
-    file_wmjzuybuorfrjdgnr2d32vg5va: was wrong, now correct, no other files
-    file_wgszcwehlffnzmoypr4l2yhvza: was correct, now correct, no other files. multiple articles in single PDF
-    file_p4r5sbminzgrhn4yaiqyr7ahwi: was correct, now wrong (!!!)
-        doi:10.1055/s-0036-1579844
-        PDF says: 10.4103/0028-3886.158210
-        unpaywall still has this incorrect linkage
-    file_n5jtvrnodbfdbccl5d6hshhvw4: now stub, was wrong
-        doi:10.19080/bboaj.2018.04.555640 release_suvtcm7hdbbr3fczcxxt4thgoi seems correct?
-        doi:10.19080/bboaj.4.4 is the bad DOI, not in fatcat. is registered, as an entire issue
-    file_kt5fv4d5lrbk7j3dxxgcm7hph4: was wrong, now correct
-    file_jq4juugnynhsdkh3whkcjod46q: was wrong, now correct
-    file_ef7w6y7k4jhjhgwmfai37yjjmm: was wrong, now correct
-    file_ca2svhd6knfnff4dktuonp45du: was correct, now correct. complicated, multiple DOIs/copies/files of same work
-    file_borst2aewvcwzhucnyth2vf3lm: was correct, now correct. complicated, multiple DOIs, single file of same work
-
-Overall, seems like this might not be as much of an obvious improvement as
-hoped! But still progress and more correct.
-
-    head -n1000 /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
-        | python -m fatcat_tools.cleanups.file_release_bugfix -
-    # Counter({'total': 1000, 'update': 929, 'skip-existing-history-updated': 58, 'skip-existing-fixed': 10, 'skip': 3, 'skip-link-source': 3, 'insert': 0, 'exists': 0})
-
-Looking at `skip-link-source`, it is cases where `link_source` is 'doi' not
-'fatcat-changelog'. Will update filter behavior, 'fatcat-changelog' is a
-`ingest_request_source`.
-
-Checking another 10 examples. They all seem to end up as correct matches.
-
-Did a small update and running the whole batch:
-
-    cat /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
-        | python -m fatcat_tools.cleanups.file_release_bugfix -
-    # Counter({'total': 10000, 'update': 8499, 'skip-existing-fixed': 939, 'skip-existing-history-updated': 560, 'skip': 2, 'skip-wrong-release-is-ok': 2, 'insert': 0, 'exists': 0})
-
-I think this is ready to go! Example with parallel:
-
-    cat /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
-        | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_release_bugfix -
-
diff --git a/notes/cleanups/file_sha1_dedupe.md b/notes/cleanups/file_sha1_dedupe.md
deleted file mode 100644
index 0829bc79..00000000
--- a/notes/cleanups/file_sha1_dedupe.md
+++ /dev/null
@@ -1,64 +0,0 @@
-
-
-## Prep
-
-Using `check_hashes.sh`:
-
-    zcat $HASH_FILE \
-        | awk '{print $3 "\t" $1}' \
-        | rg -v '^\t' \
-        | sort -S 4G \
-        | uniq -D -w 40 \
-        > sha1_ident.dupes.tsv
-
-    wc -l sha1_ident.dupes.tsv 
-    # 6,350
-
-    cut -f1 sha1_ident.dupes.tsv | uniq | wc -l
-    # 2,039
-
-Want to create JSON for each group, like:
-
-    entity_type: "file"
-    primary_id: str or None
-    duplicate_ids: [str]
-    evidence:
-        extid: str
-        extid_type: "sha1"
-
-Run transform script:
-
-    cat sha1_ident.dupes.tsv | ./file_dupe_to_json.py | pv -l > file_sha1_dupes.json
-    # 2.04k 0:00:00 [9.16k/s]
-
-
-## QA Testing
-
-    export FATCAT_AUTH_API_TOKEN=[...]
-
-    head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
-        | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -
-
-Hit some small bugs running in QA; test coverage isn't great, but I think hits
-the important parts.
-
-    head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
-        | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -
-    # Running in dry-run mode!
-    # Counter({'updated-entities': 60, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
-
-Dry-run mode didn't actually work, and edits actually happened (!).
-
-Edits do look good.
-
-Try again, not dry-run, to ensure that case is handled:
-
-    head -n25 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
-    # Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'merged': 0, 'updated-total': 0})
-
-And then run 500 through for more testing:
-
-    head -n500 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
-    # Counter({'updated-entities': 1341, 'lines': 500, 'merged': 474, 'skip': 26, 'skip-not-active-entity': 25, 'skip-entity-not-found': 1, 'updated-total': 0})
-
-The majority of merges seem to be cases where there are multiple articles in the same PDF.
diff --git a/notes/cleanups/scripts/container_dupe_to_json.py b/notes/cleanups/scripts/container_dupe_to_json.py
deleted file mode 100755
index 2e841c69..00000000
--- a/notes/cleanups/scripts/container_dupe_to_json.py
+++ /dev/null
@@ -1,55 +0,0 @@
-#!/usr/bin/env python3
-
-"""
-This script can be used to transform duplicate container entity rows into JSON
-objects which can be passed to the container entity merger.
-
-It is initially used to de-dupe ISSN-Ls. The script is based on
-`file_dupe_to_json.py`.
-"""
-
-import json, sys
-from typing import Optional
-
-EXTID_TYPE = "issnl"
-
-
-def print_group(extid, dupe_ids):
-    if len(dupe_ids) < 2:
-        return
-    group = dict(
-        entity_type="container",
-        primary_id=None,
-        duplicate_ids=dupe_ids,
-        evidence=dict(
-            extid=extid,
-            extid_type=EXTID_TYPE,
-        ),
-    )
-    print(json.dumps(group, sort_keys=True))
-
-def run():
-    last_extid = None
-    dupe_ids = []
-    for l in sys.stdin:
-        l = l.strip()
-        if not l:
-            continue
-        (row_extid, row_id) = l.split("\t")[0:2]
-        if EXTID_TYPE == "issnl":
-            assert len(row_extid) == 9
-        else:
-            raise Exception(f"extid type not supported yet: {EXTID_TYPE}")
-        if row_extid == last_extid:
-            dupe_ids.append(row_id)
-            continue
-        elif dupe_ids:
-            print_group(last_extid, dupe_ids)
-        last_extid = row_extid
-        dupe_ids = [row_id]
-    if last_extid and dupe_ids:
-        print_group(last_extid, dupe_ids)
-
-
-if __name__=="__main__":
-    run()
diff --git a/notes/cleanups/scripts/fetch_full_cdx_ts.py b/notes/cleanups/scripts/fetch_full_cdx_ts.py
deleted file mode 100644
index ebcf0d62..00000000
--- a/notes/cleanups/scripts/fetch_full_cdx_ts.py
+++ /dev/null
@@ -1,201 +0,0 @@
-#!/usr/bin/env python3
-
-import sys
-import json
-import base64
-from typing import Optional, List
-
-import requests
-from requests.adapters import HTTPAdapter
-from requests.packages.urllib3.util.retry import Retry  # pylint: disable=import-error
-
-def requests_retry_session(
-    retries: int = 10,
-    backoff_factor: int = 3,
-    status_forcelist: List[int] = [500, 502, 504],
-    session: requests.Session = None,
-) -> requests.Session:
-    """
-    From: https://www.peterbe.com/plog/best-practice-with-retries-with-requests
-    """
-    session = session or requests.Session()
-    retry = Retry(
-        total=retries,
-        read=retries,
-        connect=retries,
-        backoff_factor=backoff_factor,
-        status_forcelist=status_forcelist,
-    )
-    adapter = HTTPAdapter(max_retries=retry)
-    session.mount("http://", adapter)
-    session.mount("https://", adapter)
-    return session
-
-def b32_hex(s: str) -> str:
-    """
-    Converts a base32-encoded SHA-1 checksum into hex-encoded
-
-    base32 checksums are used by, eg, heritrix and in wayback CDX files
-    """
-    s = s.strip().split()[0].lower()
-    if s.startswith("sha1:"):
-        s = s[5:]
-    if len(s) != 32:
-        if len(s) == 40:
-            return s
-        raise ValueError("not a base-32 encoded SHA-1 hash: {}".format(s))
-    return base64.b16encode(base64.b32decode(s.upper())).lower().decode("utf-8")
-
-
-SANDCRAWLER_POSTGREST_URL = "http://wbgrp-svc506.us.archive.org:3030"
-
-def get_db_cdx(url: str, http_session) -> List[dict]:
-    resp = http_session.get(SANDCRAWLER_POSTGREST_URL + "/cdx", params=dict(url="eq." + url))
-    resp.raise_for_status()
-    rows = resp.json()
-    return rows or []
-
-CDX_API_URL = "https://web.archive.org/cdx/search/cdx"
-
-def get_api_cdx(url: str, partial_dt: str, http_session) -> Optional[dict]:
-
-    params = {
-        "url": url,
-        "from": partial_dt,
-        "to": partial_dt,
-        "matchType": "exact",
-        "output": "json",
-        "limit": 20,
-        # can't filter status because might be warc/revisit
-        #"filter": "statuscode:200",
-    }
-    resp = http_session.get(CDX_API_URL, params=params)
-    resp.raise_for_status()
-    rows = resp.json()
-
-    if not rows:
-        return None
-    #print(rows, file=sys.stderr)
-    if len(rows) < 2:
-        return None
-
-    for raw in rows[1:]:
-        record = dict(
-            surt=raw[0],
-            datetime=raw[1],
-            url=raw[2],
-            mimetype=raw[3],
-            status_code=raw[4],
-            sha1b32=raw[5],
-            sha1hex=b32_hex(raw[5]),
-        )
-        if record['url'] != url:
-            # TODO: could allow HTTP/HTTPS fuzzy match
-            print("CDX API near match: URL", file=sys.stderr)
-            continue
-        if not record['datetime'].startswith(partial_dt):
-            print(f"CDX API near match: datetime {partial_dt} {record['datetime']}", file=sys.stderr)
-            continue
-        if record['status_code'] == "200" or (record['status_code'] == '-' and record['mimetype'] == 'warc/revisit'):
-            return record
-        else:
-            print(f"CDX API near match: status  {record['status_code']}", file=sys.stderr)
-    return None
-
-def process_file(fe, session) -> dict:
-    short_urls = []
-    self_urls = dict()
-    full_urls = dict()
-    status = "unknown"
-
-    for pair in fe['urls']:
-        u = pair['url']
-        if not '://web.archive.org/web/' in u:
-            continue
-        seg = u.split('/')
-        assert seg[2] == "web.archive.org"
-        assert seg[3] == "web"
-        if not seg[4].isdigit():
-            continue
-        original_url = "/".join(seg[5:])
-        if len(seg[4]) == 12 or len(seg[4]) == 4:
-            short_urls.append(u)
-        elif len(seg[4]) == 14:
-            self_urls[original_url] = u
-        else:
-            print(f"other bogus ts: {seg[4]}", file=sys.stderr)
-            return dict(file_entity=fe, full_urls=full_urls, status="fail-bogus-ts")
-
-    if len(short_urls) == 0:
-        return dict(file_entity=fe, full_urls=[], status="skip-no-shorts")
-
-    for short in list(set(short_urls)):
-        seg = short.split('/')
-        ts = seg[4]
-        assert len(ts) in [12,4] and ts.isdigit()
-        original_url = '/'.join(seg[5:])
-
-        if short in full_urls:
-            continue
-
-        if original_url in self_urls and ts in self_urls[original_url]:
-            full_urls[short] = self_urls[original_url]
-            status = "success-self"
-            continue
-
-        cdx_row_list = get_db_cdx(original_url, http_session=session)
-        for cdx_row in cdx_row_list:
-            if cdx_row['sha1hex'] == fe['sha1'] and cdx_row['url'] == original_url and cdx_row['datetime'].startswith(ts):
-                assert len(cdx_row['datetime']) == 14 and cdx_row['datetime'].isdigit()
-                full_urls[short] = f"https://web.archive.org/web/{cdx_row['datetime']}/{original_url}"
-                status = "success-db"
-                break
-            else:
-                #print(f"cdx DB found, but no match", file=sys.stderr)
-                pass
-        cdx_row = None
-
-        if short in full_urls:
-            continue
-
-        cdx_record = None
-        try:
-            cdx_record = get_api_cdx(original_url, partial_dt=ts, http_session=session)
-        except requests.exceptions.HTTPError as e:
-            if e.response.status_code == 403:
-                return dict(file_entity=fe, full_urls=full_urls, status="fail-cdx-403")
-            else:
-                raise
-        if cdx_record:
-            if cdx_record['sha1hex'] == fe['sha1'] and cdx_record['url'] == original_url and cdx_record['datetime'].startswith(ts):
-                assert len(cdx_record['datetime']) == 14 and cdx_record['datetime'].isdigit()
-                full_urls[short] = f"https://web.archive.org/web/{cdx_record['datetime']}/{original_url}"
-                status = "success-api"
-                break
-            else:
-                print(f"cdx API found, but no match", file=sys.stderr)
-        else:
-            print(f"no CDX API record found: {original_url}", file=sys.stderr)
-
-        if short not in full_urls:
-            return dict(file_entity=fe, full_urls=full_urls, status="fail-not-found")
-
-    return dict(
-        file_entity=fe,
-        full_urls=full_urls,
-        status=status,
-    )
-
-def main():
-    session = requests_retry_session()
-    session.headers.update({
-        "User-Agent": "Mozilla/5.0 fatcat.CdxFixupBot",
-    })
-    for line in sys.stdin:
-        if not line.strip():
-            continue
-        fe = json.loads(line)
-        print(json.dumps(process_file(fe, session=session)))
-
-if __name__=="__main__":
-    main()
diff --git a/notes/cleanups/scripts/file2ingestrequest.py b/notes/cleanups/scripts/file2ingestrequest.py
deleted file mode 100755
index a005837f..00000000
--- a/notes/cleanups/scripts/file2ingestrequest.py
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env python3
-
-from typing import Optional
-import json, sys
-
-
-def transform(row: dict) -> Optional[dict]:
-    if row.get('mimetype') not in [None, 'application/pdf']:
-        return None
-    if row.get('state') != 'active':
-        return None
-    base_url = None
-    for url in (row.get('urls') or []):
-        url = url['url']
-        if '://web.archive.org/' not in url and '://archive.org/' not in url:
-            base_url = url
-            break
-    if not base_url:
-        return None
-    if not row.get('sha1'):
-        return None
-    return dict(
-        base_url=base_url,
-        ingest_type="pdf",
-        link_source="fatcat",
-        link_source_id=f"file_{row['ident']}",
-        ingest_request_source="file-backfill",
-        ext_ids=dict(
-            sha1=row['sha1'],
-        ),
-    )
-
-
-def run():
-    for l in sys.stdin:
-        if not l.strip():
-            continue
-        row = json.loads(l)
-        request = transform(row)
-        if request:
-            print(json.dumps(request, sort_keys=True))
-
-if __name__=="__main__":
-    run()
diff --git a/notes/cleanups/scripts/file_dupe_to_json.py b/notes/cleanups/scripts/file_dupe_to_json.py
deleted file mode 100755
index 2064dc1c..00000000
--- a/notes/cleanups/scripts/file_dupe_to_json.py
+++ /dev/null
@@ -1,72 +0,0 @@
-#!/usr/bin/env python3
-
-"""
-This script can be used to transform duplicate file entity hash export rows
-into JSON objects which can be passed to the file entity merger.
-
-The input is expected to be a TSV with two columns: a hash value in the first
-column, and a fatcat file entity ident (in UUID format, not "fatcat ident"
-encoded) in the second column. The rows are assumed to be sorted by hash value
-(the first column), and duplicate values (same hash, differing UUID) are
-contiguous.
-
-File hashes aren't really "external identifiers" (ext_id), but we treat them as
-such here.
-
-Script is pretty simple, should be possible to copy and reuse for release,
-container, creator entity duplicates.
-"""
-
-import json, sys
-from typing import Optional
-import base64, uuid
-
-EXTID_TYPE = "sha1"
-
-def uuid2fcid(s: str) -> str:
-    """
-    Converts a uuid.UUID object to a fatcat identifier (base32 encoded string)
-    """     
-    raw = uuid.UUID(s).bytes
-    return base64.b32encode(raw)[:26].lower().decode("utf-8")
-
-def print_group(extid, dupe_ids):
-    if len(dupe_ids) < 2:
-        return
-    group = dict(
-        entity_type="file",
-        primary_id=None,
-        duplicate_ids=dupe_ids,
-        evidence=dict(
-            extid=extid,
-            extid_type=EXTID_TYPE,
-        ),
-    )
-    print(json.dumps(group, sort_keys=True))
-
-def run():
-    last_extid = None
-    dupe_ids = []
-    for l in sys.stdin:
-        l = l.strip()
-        if not l:
-            continue
-        (row_extid, row_uuid) = l.split("\t")[0:2]
-        if EXTID_TYPE == "sha1":
-            assert len(row_extid) == 40
-        else:
-            raise Exception(f"extid type not supported yet: {EXTID_TYPE}")
-        row_id = uuid2fcid(row_uuid)
-        if row_extid == last_extid:
-            dupe_ids.append(row_id)
-            continue
-        elif dupe_ids:
-            print_group(last_extid, dupe_ids)
-        last_extid = row_extid
-        dupe_ids = [row_id]
-    if last_extid and dupe_ids:
-        print_group(last_extid, dupe_ids)
-
-
-if __name__=="__main__":
-    run()
diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md
deleted file mode 100644
index 9db77058..00000000
--- a/notes/cleanups/wayback_timestamps.md
+++ /dev/null
@@ -1,304 +0,0 @@
-
-At some point, using the arabesque importer (from targeted crawling), we
-accidentally imported a bunch of files with wayback URLs that have 12-digit
-timestamps, instead of the full canonical 14-digit timestamps.
-
-
-## Prep (2021-11-04)
-
-Download most recent file export:
-
-    wget https://archive.org/download/fatcat_bulk_exports_2021-10-07/file_export.json.gz
-
-Filter to files with problem of interest:
-
-    zcat file_export.json.gz \
-        | pv -l \
-        | rg 'web.archive.org/web/\d{12}/' \
-        | gzip \
-        > files_20211007_shortts.json.gz
-    # 111M 0:12:35
-
-    zcat files_20211007_shortts.json.gz | wc -l
-    # 7,935,009
-
-    zcat files_20211007_shortts.json.gz | shuf -n10000 > files_20211007_shortts.10k_sample.json
-
-Wow, this is a lot more than I thought!
-
-There might also be some other short URL patterns, check for those:
-
-    zcat file_export.json.gz \
-        | pv -l \
-        | rg 'web.archive.org/web/\d{1,11}/' \
-        | gzip \
-        > files_20211007_veryshortts.json.gz
-    # skipped, mergine with below
-
-    zcat file_export.json.gz \
-        | rg 'web.archive.org/web/None/' \
-        | pv -l \
-        > /dev/null
-    # 0.00  0:10:06 [0.00 /s]
-    # whew, that pattern has been fixed it seems
-
-    zcat file_export.json.gz | rg '/None/' | pv -l > /dev/null
-    # 2.00  0:10:01 [3.33m/s]
-
-    zcat file_export.json.gz \
-        | rg 'web.archive.org/web/\d{13}/' \
-        | pv -l \
-        > /dev/null
-    # 0.00  0:10:09 [0.00 /s]
-
-Yes, 4-digit is a popular pattern as well, need to handle those:
-
-    zcat file_export.json.gz \
-        | pv -l \
-        | rg 'web.archive.org/web/\d{4,12}/' \
-        | gzip \
-        > files_20211007_moreshortts.json.gz
-    # 111M 0:13:22 [ 139k/s]
-
-    zcat files_20211007_moreshortts.json.gz | wc -l
-    # 9,958,854
-
-    zcat files_20211007_moreshortts.json.gz | shuf -n10000 > files_20211007_moreshortts.10k_sample.json
-
-
-## Fetch Complete URL
-
-Want to export JSON like:
-
-    file_entity
-        [existing file entity]
-    full_urls[]: list of Dicts[str,str]
-        <short_url>: <full_url>
-    status: str
-
-Status one of:
-
-- 'success-self': the file already has a fixed URL internally
-- 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched
-- 'success-cdx': CDX API lookup succeeded, and SHA1 matched
-- 'fail-not-found': no matching CDX record found
-
-Ran over a sample:
-
-    cat files_20211007_shortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json
-
-    cat sample_out.json | jq .status | sort | uniq -c
-          5 "fail-not-found"
-        576 "success-api"
-       7212 "success-db"
-       2207 "success-self"
-
-    head -n1000  | ./fetch_full_cdx_ts.py > sample_out.json
-
-    zcat files_20211007_veryshortts.json.gz | head -n1000 | ./fetch_full_cdx_ts.py | jq .status | sort | uniq -c
-          2 "fail-not-found"
-        168 "success-api"
-        208 "success-db"
-        622 "success-self"
-
-Investigating the "fail-not-found", they look like http/https URL
-not-exact-matches. Going to put off handling these for now because it is a
-small fraction and more delicate.
-
-Again with the broader set:
-
-    cat files_20211007_moreshortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json
-
-    cat sample_out.json | jq .status | sort | uniq -c
-          9 "fail-not-found"
-        781 "success-api"
-       6175 "success-db"
-       3035 "success-self"
-
-While running a larger batch, got a CDX API error:
-
-    requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.psychologytoday.com%2Ffiles%2Fu47%2FHenry_et_al.pdf&from=2017&to=2017&matchType=exact&output=json&limit=20
-
-    org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error
-
-So maybe need to use credentials after all.
-
-
-## Cleanup Process
-
-Other possible cleanups to run at the same time, which would not require
-external requests or other context:
-
-- URL has ://archive.org/ link with rel=repository => rel=archive
-- mimetype is bogus => clean mimetype
-- bogus file => set some new extra field, like scope=stub or scope=partial (?)
-
-It looks like the rel swap is already implemented in `generic_file_cleanups()`.
-From sampling it seems like the mimetype issue is pretty small, so not going to
-bite that off now. The "bogus file" issue requires thought, so also skipping.
-
-
-## Commands (old)
-
-Running with 8x parallelism to not break things; expecting some errors along
-the way, may need to add handlers for connection errors etc:
-
-    # OLD SNAPSHOT
-    zcat files_20211007_moreshortts.json.gz \
-        | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
-        | pv -l \
-        | gzip \
-        > files_20211007_moreshortts.fetched.json.gz
-
-At 300 records/sec, this should take around 9-10 hours to process.
-
-
-
-## Prep Again (2021-11-09)
-
-After fixing "sort" issue and re-dumping file entities (2021-11-05 snapshot).
-
-Filter again:
-
-    # note: in the future use pigz instead of gzip here
-    zcat file_export.json.gz \
-        | pv -l \
-        | rg 'web.archive.org/web/\d{4,12}/' \
-        | gzip \
-        > files_20211105_moreshortts.json.gz
-    # 112M 0:13:27 [ 138k/s]
-
-    zcat files_20211105_moreshortts.json.gz | wc -l
-    # 9,958,854
-    # good, exact same number as previous snapshot
-
-    zcat files_20211105_moreshortts.json.gz | shuf -n10000 > files_20211105_moreshortts.10k_sample.json
-    # done
-
-    cat files_20211105_moreshortts.10k_sample.json \
-        | ./fetch_full_cdx_ts.py \
-        | pv -l \
-        > files_20211105_moreshortts.10k_sample.fetched.json
-    # 10.0k 0:03:36 [46.3 /s]
-
-    cat files_20211105_moreshortts.10k_sample.fetched.json | jq .status | sort | uniq -c
-         13 "fail-not-found"
-        774 "success-api"
-       6193 "success-db"
-       3020 "success-self"
-
-After tweaking `success-self` logic:
-
-         13 "fail-not-found"
-        859 "success-api"
-       6229 "success-db"
-       2899 "success-self"
-
-
-## Testing in QA
-
-Copied `sample_out.json` to fatcat QA instance and renamed as `files_20211007_moreshortts.10k_sample.fetched.json`
-
-    # OLD ATTEMPT
-    export FATCAT_API_AUTH_TOKEN=[...]
-    head -n10 /srv/fatcat/datasets/files_20211007_moreshortts.10k_sample.fetched.json \
-        | python -m fatcat_tools.cleanups.file_short_wayback_ts -
-
-Ran in to issues, iterated above.
-
-Trying again with updated script and sample file:
-
-    export FATCAT_AUTH_WORKER_CLEANUP=[...]
-
-    head -n10 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
-        | python -m fatcat_tools.cleanups.file_short_wayback_ts -
-    # Counter({'total': 10, 'update': 10, 'skip': 0, 'insert': 0, 'exists': 0})
-
-Manually inspected and these look good. Trying some repeats and larger batched:
-
-    head -n10 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
-        | python -m fatcat_tools.cleanups.file_short_wayback_ts -
-    # Counter({'total': 10, 'skip-revision-changed': 10, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0})
-
-    head -n1000 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
-        | python -m fatcat_tools.cleanups.file_short_wayback_ts -
-
-    [...]
-    bad replacement URL: partial_ts=201807271139 original=http://www.scielo.br/pdf/qn/v20n1/4918.pdf fix_url=https://web.archive.org/web/20170819080342/http://www.scielo.br/pdf/qn/v20n1/4918.pdf
-    bad replacement URL: partial_ts=201904270207 original=https://www.matec-conferences.org/articles/matecconf/pdf/2018/62/matecconf_iccoee2018_03008.pdf fix_url=https://web.archive.org/web/20190501060839/https://www.matec-conferences.org/articles/matecconf/pdf/2018/62/matecconf_iccoee2018_03008.pdf
-    bad replacement URL: partial_ts=201905011445 original=https://cdn.intechopen.com/pdfs/5886.pdf fix_url=https://web.archive.org/web/20190502203832/https://cdn.intechopen.com/pdfs/5886.pdf
-    [...]
-
-    # Counter({'total': 1000, 'update': 969, 'skip': 19, 'skip-bad-replacement': 18, 'skip-revision-changed': 10, 'skip-bad-wayback-timestamp': 2, 'skip-status': 1, 'insert': 0, 'exists': 0})
-
-
-It looks like these "bad replacement URLs" are due to timestamp mismatches. Eg, the partial timestamp is not part of the final timestamp.
-
-Tweaked fetch script and re-ran:
-
-    # Counter({'total': 1000, 'skip-revision-changed': 979, 'update': 18, 'skip-bad-wayback-timestamp': 2, 'skip': 1, 'skip-status': 1, 'insert': 0, 'exists': 0})
-
-Cool. Sort of curious what the deal is with those `skip-bad-wayback-timestamp`.
-
-Run the rest through:
-
-    cat /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
-        | python -m fatcat_tools.cleanups.file_short_wayback_ts -
-    # Counter({'total': 10000, 'update': 8976, 'skip-revision-changed': 997, 'skip-bad-wayback-timestamp': 14, 'skip': 13, 'skip-status': 13, 'insert': 0, 'exists': 0})
-
-Should tweak batch size to 100 (vs. 50).
-
-How to parallelize import:
-
-    # from within pipenv
-    cat /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
-        | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_short_wayback_ts -
-
-
-## Full Batch Commands
-
-Running in bulk again:
-
-    zcat files_20211105_moreshortts.json.gz \
-        | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
-        | pv -l \
-        | gzip \
-        > files_20211105_moreshortts.fetched.json.gz
-
-Ran in to one: `requests.exceptions.HTTPError: 503 Server Error: Service
-Temporarily Unavailable for url: [...]`. Will try again, if there are more
-failures may need to split up in smaller chunks.
-
-Unexpected:
-
-    Traceback (most recent call last):
-      File "./fetch_full_cdx_ts.py", line 200, in <module>
-        main()
-      File "./fetch_full_cdx_ts.py", line 197, in main
-        print(json.dumps(process_file(fe, session=session)))
-      File "./fetch_full_cdx_ts.py", line 118, in process_file
-        assert seg[4].isdigit()
-    AssertionError
-    3.96M 3:04:46 [ 357 /s]
-
-Ugh.
-
-    zcat files_20211105_moreshortts.json.gz \
-        | tac \
-        | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
-        | pv -l \
-        | gzip \
-        > files_20211105_moreshortts.fetched.json.gz
-    # 9.96M 6:38:43 [ 416 /s]
-
-Looks like the last small tweak was successful! This was with git commit
-`cd09c6d6bd4deef0627de4f8a8a301725db01e14`.
-
-
-    zcat files_20211105_moreshortts.fetched.json.gz | jq .status | sort | uniq -c | sort -nr
-      6228307 "success-db"
-      2876033 "success-self"
-       846844 "success-api"
-         7583 "fail-not-found"
-           87 "fail-cdx-403"
-
author	Bryan Newbold <bnewbold@robocracy.org>	2021-11-29 14:33:14 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2021-11-29 14:33:14 -0800
commit	c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (patch)
tree	7d3934e4922439402f882a374fe477906fd41aae /notes
parent	ec2809ef2ac51c992463839c1e3451927f5e1661 (diff)
download	fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.tar.gz fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.zip