diff options
| author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 15:46:20 -0800 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 15:46:20 -0800 | 
| commit | 996b2e2084c1798126bd91dd950c063982398bec (patch) | |
| tree | 2c4a9eef6432158088fea255db9e8b7b098a371d /notes | |
| parent | a246b5a54dac6b29a30e90265d64a3c4332902e5 (diff) | |
| download | fatcat-996b2e2084c1798126bd91dd950c063982398bec.tar.gz fatcat-996b2e2084c1798126bd91dd950c063982398bec.zip | |
more iteration on short wayback timestamp cleanup
Diffstat (limited to 'notes')
| -rw-r--r-- | notes/cleanups/scripts/fetch_full_cdx_ts.py | 2 | ||||
| -rw-r--r-- | notes/cleanups/wayback_timestamps.md | 129 | 
2 files changed, 128 insertions, 3 deletions
| diff --git a/notes/cleanups/scripts/fetch_full_cdx_ts.py b/notes/cleanups/scripts/fetch_full_cdx_ts.py index 6f67c7e1..d5b0c476 100644 --- a/notes/cleanups/scripts/fetch_full_cdx_ts.py +++ b/notes/cleanups/scripts/fetch_full_cdx_ts.py @@ -137,7 +137,7 @@ def process_file(fe, session) -> dict:          if short in full_urls:              continue -        if original_url in self_urls: +        if original_url in self_urls and ts in self_urls[original_url]:              full_urls[short] = self_urls[original_url]              status = "success-self"              continue diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md index 81785992..85e5f94f 100644 --- a/notes/cleanups/wayback_timestamps.md +++ b/notes/cleanups/wayback_timestamps.md @@ -61,9 +61,10 @@ Yes, 4-digit is a popular pattern as well, need to handle those:      # 111M 0:13:22 [ 139k/s]      zcat files_20211007_moreshortts.json.gz | wc -l +    # 9,958,854      zcat files_20211007_moreshortts.json.gz | shuf -n10000 > files_20211007_moreshortts.10k_sample.json -    # 9,958,854 +  ## Fetch Complete URL @@ -114,6 +115,14 @@ Again with the broader set:         6175 "success-db"         3035 "success-self" +While running a larger batch, got a CDX API error: + +    requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.psychologytoday.com%2Ffiles%2Fu47%2FHenry_et_al.pdf&from=2017&to=2017&matchType=exact&output=json&limit=20 + +    org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error + +So maybe need to use credentials after all. +  ## Cleanup Process @@ -128,11 +137,13 @@ It looks like the rel swap is already implemented in `generic_file_cleanups()`.  From sampling it seems like the mimetype issue is pretty small, so not going to  bite that off now. The "bogus file" issue requires thought, so also skipping. -## Commands + +## Commands (old)  Running with 8x parallelism to not break things; expecting some errors along  the way, may need to add handlers for connection errors etc: +    # OLD SNAPSHOT      zcat files_20211007_moreshortts.json.gz \          | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \          | pv -l \ @@ -140,3 +151,117 @@ the way, may need to add handlers for connection errors etc:          > files_20211007_moreshortts.fetched.json.gz  At 300 records/sec, this should take around 9-10 hours to process. + + + +## Prep Again (2021-11-09) + +After fixing "sort" issue and re-dumping file entities (2021-11-05 snapshot). + +Filter again: + +    # note: in the future use pigz instead of gzip here +    zcat file_export.json.gz \ +        | pv -l \ +        | rg 'web.archive.org/web/\d{4,12}/' \ +        | gzip \ +        > files_20211105_moreshortts.json.gz +    # 112M 0:13:27 [ 138k/s] + +    zcat files_20211105_moreshortts.json.gz | wc -l +    # 9,958,854 +    # good, exact same number as previous snapshot + +    zcat files_20211105_moreshortts.json.gz | shuf -n10000 > files_20211105_moreshortts.10k_sample.json +    # done + +    cat files_20211105_moreshortts.10k_sample.json \ +        | ./fetch_full_cdx_ts.py \ +        | pv -l \ +        > files_20211105_moreshortts.10k_sample.fetched.json +    # 10.0k 0:03:36 [46.3 /s] + +    cat files_20211105_moreshortts.10k_sample.fetched.json | jq .status | sort | uniq -c +         13 "fail-not-found" +        774 "success-api" +       6193 "success-db" +       3020 "success-self" + +After tweaking `success-self` logic: + +         13 "fail-not-found" +        859 "success-api" +       6229 "success-db" +       2899 "success-self" + + +## Testing in QA + +Copied `sample_out.json` to fatcat QA instance and renamed as `files_20211007_moreshortts.10k_sample.fetched.json` + +    # OLD ATTEMPT +    export FATCAT_API_AUTH_TOKEN=[...] +    head -n10 /srv/fatcat/datasets/files_20211007_moreshortts.10k_sample.fetched.json \ +        | python -m fatcat_tools.cleanups.file_short_wayback_ts - + +Ran in to issues, iterated above. + +Trying again with updated script and sample file: + +    export FATCAT_AUTH_WORKER_CLEANUP=[...] + +    head -n10 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \ +        | python -m fatcat_tools.cleanups.file_short_wayback_ts - +    # Counter({'total': 10, 'update': 10, 'skip': 0, 'insert': 0, 'exists': 0}) + +Manually inspected and these look good. Trying some repeats and larger batched: + +    head -n10 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \ +        | python -m fatcat_tools.cleanups.file_short_wayback_ts - +    # Counter({'total': 10, 'skip-revision-changed': 10, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0}) + +    head -n1000 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \ +        | python -m fatcat_tools.cleanups.file_short_wayback_ts - + +    [...] +    bad replacement URL: partial_ts=201807271139 original=http://www.scielo.br/pdf/qn/v20n1/4918.pdf fix_url=https://web.archive.org/web/20170819080342/http://www.scielo.br/pdf/qn/v20n1/4918.pdf +    bad replacement URL: partial_ts=201904270207 original=https://www.matec-conferences.org/articles/matecconf/pdf/2018/62/matecconf_iccoee2018_03008.pdf fix_url=https://web.archive.org/web/20190501060839/https://www.matec-conferences.org/articles/matecconf/pdf/2018/62/matecconf_iccoee2018_03008.pdf +    bad replacement URL: partial_ts=201905011445 original=https://cdn.intechopen.com/pdfs/5886.pdf fix_url=https://web.archive.org/web/20190502203832/https://cdn.intechopen.com/pdfs/5886.pdf +    [...] + +    # Counter({'total': 1000, 'update': 969, 'skip': 19, 'skip-bad-replacement': 18, 'skip-revision-changed': 10, 'skip-bad-wayback-timestamp': 2, 'skip-status': 1, 'insert': 0, 'exists': 0}) + + +It looks like these "bad replacement URLs" are due to timestamp mismatches. Eg, the partial timestamp is not part of the final timestamp. + +Tweaked fetch script and re-ran: + +    # Counter({'total': 1000, 'skip-revision-changed': 979, 'update': 18, 'skip-bad-wayback-timestamp': 2, 'skip': 1, 'skip-status': 1, 'insert': 0, 'exists': 0}) + +Cool. Sort of curious what the deal is with those `skip-bad-wayback-timestamp`. + +Run the rest through: + +    cat /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \ +        | python -m fatcat_tools.cleanups.file_short_wayback_ts - +    # Counter({'total': 10000, 'update': 8976, 'skip-revision-changed': 997, 'skip-bad-wayback-timestamp': 14, 'skip': 13, 'skip-status': 13, 'insert': 0, 'exists': 0}) + +Should tweak batch size to 100 (vs. 50). + +How to parallelize import: + +    # from within pipenv +    cat /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \ +        | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_short_wayback_ts - + + +## Full Batch Commands + +Running in bulk again: + +    zcat files_20211105_moreshortts.json.gz \ +        | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \ +        | pv -l \ +        | gzip \ +        > files_20211105_moreshortts.fetched.json.gz + | 
