diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 21:12:54 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 21:12:54 -0800 |
commit | ca3aedcefd5c3835b6fa9b685200ef12d635d22c (patch) | |
tree | 668dfc2280e5755b99e72bf8e089c642b9420315 /notes/cleanups/wayback_timestamps.md | |
parent | 86056629e7778543dc31c962e7e3f273e1ec48e6 (diff) | |
download | fatcat-ca3aedcefd5c3835b6fa9b685200ef12d635d22c.tar.gz fatcat-ca3aedcefd5c3835b6fa9b685200ef12d635d22c.zip |
update cleanups notes
Diffstat (limited to 'notes/cleanups/wayback_timestamps.md')
-rw-r--r-- | notes/cleanups/wayback_timestamps.md | 24 |
1 files changed, 24 insertions, 0 deletions
diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md index 85e5f94f..38bc3f7d 100644 --- a/notes/cleanups/wayback_timestamps.md +++ b/notes/cleanups/wayback_timestamps.md @@ -265,3 +265,27 @@ Running in bulk again: | gzip \ > files_20211105_moreshortts.fetched.json.gz +Ran in to one: `requests.exceptions.HTTPError: 503 Server Error: Service +Temporarily Unavailable for url: [...]`. Will try again, if there are more +failures may need to split up in smaller chunks. + +Unexpected: + + Traceback (most recent call last): + File "./fetch_full_cdx_ts.py", line 200, in <module> + main() + File "./fetch_full_cdx_ts.py", line 197, in main + print(json.dumps(process_file(fe, session=session))) + File "./fetch_full_cdx_ts.py", line 118, in process_file + assert seg[4].isdigit() + AssertionError + 3.96M 3:04:46 [ 357 /s] + +Ugh. + + zcat files_20211105_moreshortts.json.gz \ + | tac \ + | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \ + | pv -l \ + | gzip \ + > files_20211105_moreshortts.fetched.json.gz |