aboutsummaryrefslogtreecommitdiffstats
path: root/notes/cleanups/wayback_timestamps.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-09 21:12:54 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-09 21:12:54 -0800
commitca3aedcefd5c3835b6fa9b685200ef12d635d22c (patch)
tree668dfc2280e5755b99e72bf8e089c642b9420315 /notes/cleanups/wayback_timestamps.md
parent86056629e7778543dc31c962e7e3f273e1ec48e6 (diff)
downloadfatcat-ca3aedcefd5c3835b6fa9b685200ef12d635d22c.tar.gz
fatcat-ca3aedcefd5c3835b6fa9b685200ef12d635d22c.zip
update cleanups notes
Diffstat (limited to 'notes/cleanups/wayback_timestamps.md')
-rw-r--r--notes/cleanups/wayback_timestamps.md24
1 files changed, 24 insertions, 0 deletions
diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md
index 85e5f94f..38bc3f7d 100644
--- a/notes/cleanups/wayback_timestamps.md
+++ b/notes/cleanups/wayback_timestamps.md
@@ -265,3 +265,27 @@ Running in bulk again:
| gzip \
> files_20211105_moreshortts.fetched.json.gz
+Ran in to one: `requests.exceptions.HTTPError: 503 Server Error: Service
+Temporarily Unavailable for url: [...]`. Will try again, if there are more
+failures may need to split up in smaller chunks.
+
+Unexpected:
+
+ Traceback (most recent call last):
+ File "./fetch_full_cdx_ts.py", line 200, in <module>
+ main()
+ File "./fetch_full_cdx_ts.py", line 197, in main
+ print(json.dumps(process_file(fe, session=session)))
+ File "./fetch_full_cdx_ts.py", line 118, in process_file
+ assert seg[4].isdigit()
+ AssertionError
+ 3.96M 3:04:46 [ 357 /s]
+
+Ugh.
+
+ zcat files_20211105_moreshortts.json.gz \
+ | tac \
+ | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
+ | pv -l \
+ | gzip \
+ > files_20211105_moreshortts.fetched.json.gz