aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2023-01-02 19:10:01 -0800
committerBryan Newbold <bnewbold@archive.org>2023-01-02 19:10:01 -0800
commite433990172c157707d92452652aefe2f21b6a4a0 (patch)
treede662f71c65447017828de4f10fe43eb5705c40f
parentb7e4629f3c84f35af5ad62346a9480bea957c719 (diff)
downloadsandcrawler-e433990172c157707d92452652aefe2f21b6a4a0.tar.gz
sandcrawler-e433990172c157707d92452652aefe2f21b6a4a0.zip
python-specific README file
-rw-r--r--TODO2
-rw-r--r--python/README.md46
-rw-r--r--python/TODO7
3 files changed, 48 insertions, 7 deletions
diff --git a/TODO b/TODO
index 77b48c9..33dc147 100644
--- a/TODO
+++ b/TODO
@@ -1,4 +1,6 @@
+Note: as of 2022 this file is ancient and need review
+
## Kafka Pipelines
- after network split, mass restarting import/harvest stuff seemed to
diff --git a/python/README.md b/python/README.md
new file mode 100644
index 0000000..4395f19
--- /dev/null
+++ b/python/README.md
@@ -0,0 +1,46 @@
+
+This directory contains `sandcrawler` python code for ingest pipelines, batch
+processing, PDF extraction, etc.
+
+
+## Development Quickstart
+
+As of December 2022, working with this code requires:
+
+- Python 3.8 (specifically, due to version specification in `pipenv`)
+- `pipenv` for python dependency management
+- generic and python-specific build tools (`pkg-config`, `python-dev`, etc)
+- poppler (PDF processing library)
+- libmagic
+- libsodium
+- access to IA internal packages (`devpi.us.archive.org`), specifically for
+ globalwayback and related packages
+
+In production and CI we use Ubuntu Focal (20.04). The CI script for this
+repository (`../.gitlab-ci.yml`) is the best place to look for a complete list
+of dependencies for both development and deployment. Note that our CI system
+runs from our cluster, which resolves the devpi access issue. For developer
+laptops, you may need `sshuttle` or something similar set up to do initial
+package pulls.
+
+It is recommended to set the env variable `PIPENV_VENV_IN_PROJECT=true` when
+working with pipenv. You can include this in a `.env` file.
+
+There is a Makefile which helps with the basics. Eg:
+
+ # install deps using pipenv
+ make deps
+
+ # run python tests
+ make test
+
+ # run code formatting and lint checks
+ make fmt lint
+
+Sometimes when developing it is helpful to enter a shell with pipenv, eg:
+
+ pipenv shell
+
+Often when developing it is helpful (or necessary) to set environment
+variables. `pipenv shell` will read from `.env`, so you can copy and edit
+`example.env`, and it will be used in tests, `pipenv shell`, etc.
diff --git a/python/TODO b/python/TODO
deleted file mode 100644
index 58a463f..0000000
--- a/python/TODO
+++ /dev/null
@@ -1,7 +0,0 @@
-
-ingest crawler:
-- SPNv2 only
- - remove most SPNv1/v2 path selection
-- landing page + fulltext hops only (short recursion depth)
-- use wayback client library instead of requests to fetch content
-- https://pypi.org/project/ratelimit/