diff options
-rw-r--r-- | TODO | 2 | ||||
-rw-r--r-- | python/README.md | 46 | ||||
-rw-r--r-- | python/TODO | 7 |
3 files changed, 48 insertions, 7 deletions
@@ -1,4 +1,6 @@ +Note: as of 2022 this file is ancient and need review + ## Kafka Pipelines - after network split, mass restarting import/harvest stuff seemed to diff --git a/python/README.md b/python/README.md new file mode 100644 index 0000000..4395f19 --- /dev/null +++ b/python/README.md @@ -0,0 +1,46 @@ + +This directory contains `sandcrawler` python code for ingest pipelines, batch +processing, PDF extraction, etc. + + +## Development Quickstart + +As of December 2022, working with this code requires: + +- Python 3.8 (specifically, due to version specification in `pipenv`) +- `pipenv` for python dependency management +- generic and python-specific build tools (`pkg-config`, `python-dev`, etc) +- poppler (PDF processing library) +- libmagic +- libsodium +- access to IA internal packages (`devpi.us.archive.org`), specifically for + globalwayback and related packages + +In production and CI we use Ubuntu Focal (20.04). The CI script for this +repository (`../.gitlab-ci.yml`) is the best place to look for a complete list +of dependencies for both development and deployment. Note that our CI system +runs from our cluster, which resolves the devpi access issue. For developer +laptops, you may need `sshuttle` or something similar set up to do initial +package pulls. + +It is recommended to set the env variable `PIPENV_VENV_IN_PROJECT=true` when +working with pipenv. You can include this in a `.env` file. + +There is a Makefile which helps with the basics. Eg: + + # install deps using pipenv + make deps + + # run python tests + make test + + # run code formatting and lint checks + make fmt lint + +Sometimes when developing it is helpful to enter a shell with pipenv, eg: + + pipenv shell + +Often when developing it is helpful (or necessary) to set environment +variables. `pipenv shell` will read from `.env`, so you can copy and edit +`example.env`, and it will be used in tests, `pipenv shell`, etc. diff --git a/python/TODO b/python/TODO deleted file mode 100644 index 58a463f..0000000 --- a/python/TODO +++ /dev/null @@ -1,7 +0,0 @@ - -ingest crawler: -- SPNv2 only - - remove most SPNv1/v2 path selection -- landing page + fulltext hops only (short recursion depth) -- use wayback client library instead of requests to fetch content -- https://pypi.org/project/ratelimit/ |