aboutsummaryrefslogtreecommitdiffstats
path: root/arabesque.py
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-04-24 02:13:00 +0000
committerBryan Newbold <bnewbold@archive.org>2019-04-24 02:13:06 +0000
commitd6457355b5241d32333718ba7aca316976695019 (patch)
treeff5470006439a532f0d0b0209734a32df2903db6 /arabesque.py
parent71ed3d20c6898df32a31c9b1ecc843e56c976e9d (diff)
downloadarabesque-d6457355b5241d32333718ba7aca316976695019.tar.gz
arabesque-d6457355b5241d32333718ba7aca316976695019.zip
small doc/TODO notes
Diffstat (limited to 'arabesque.py')
-rwxr-xr-xarabesque.py7
1 files changed, 4 insertions, 3 deletions
diff --git a/arabesque.py b/arabesque.py
index 8dbc0ca..55e6223 100755
--- a/arabesque.py
+++ b/arabesque.py
@@ -11,8 +11,8 @@ Commands/modes:
- backward <input.log> <input-map.sqlite> <output.sqlite>
- forward <input.seed_identifiers> <output.sqlite>
- everything <input.log> <input.cdx> <input.seed_identifiers> <output.sqlite>
-- postprocess
-- dump_json
+- postprocess <sha1_status.tsv> <output.sqlite>
+- dump_json <output.sqlite>
Design docs in DESIGN.md
@@ -21,8 +21,9 @@ Software under the GPLv3 license (a copy of which should be included with this
file).
TODO:
+- pass SHA-1 and timestamp in forward mode (?)
+- include final_size (if possible from crawl log)
- open map in read-only when appropriate
-- some kind of stats dump command? (querying sqlite)
- should referrer map be UNIQ?
- forward outputs get generated multiple times?
- try: https://pypi.org/project/urlcanon/