1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
|
This directory contains `sandcrawler` python code for ingest pipelines, batch
processing, PDF extraction, etc.
## Development Quickstart
As of December 2022, working with this code requires:
- Python 3.8 (specifically, due to version specification in `pipenv`)
- `pipenv` for python dependency management
- generic and python-specific build tools (`pkg-config`, `python-dev`, etc)
- poppler (PDF processing library)
- libmagic
- libsodium
- access to IA internal packages (`devpi.us.archive.org`), specifically for
globalwayback and related packages
In production and CI we use Ubuntu Focal (20.04). The CI script for this
repository (`../.gitlab-ci.yml`) is the best place to look for a complete list
of dependencies for both development and deployment. Note that our CI system
runs from our cluster, which resolves the devpi access issue. For developer
laptops, you may need `sshuttle` or something similar set up to do initial
package pulls.
It is recommended to set the env variable `PIPENV_VENV_IN_PROJECT=true` when
working with pipenv. You can include this in a `.env` file.
There is a Makefile which helps with the basics. Eg:
# install deps using pipenv
make deps
# run python tests
make test
# run code formatting and lint checks
make fmt lint
Sometimes when developing it is helpful to enter a shell with pipenv, eg:
pipenv shell
Often when developing it is helpful (or necessary) to set environment
variables. `pipenv shell` will read from `.env`, so you can copy and edit
`example.env`, and it will be used in tests, `pipenv shell`, etc.
|