aboutsummaryrefslogtreecommitdiffstats
path: root/python/README.md
blob: 4395f19dfedca2da7235262efa59f8a1fe5dda14 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

This directory contains `sandcrawler` python code for ingest pipelines, batch
processing, PDF extraction, etc.


## Development Quickstart

As of December 2022, working with this code requires:

- Python 3.8 (specifically, due to version specification in `pipenv`)
- `pipenv` for python dependency management
- generic and python-specific build tools (`pkg-config`, `python-dev`, etc)
- poppler (PDF processing library)
- libmagic
- libsodium
- access to IA internal packages (`devpi.us.archive.org`), specifically for
  globalwayback and related packages

In production and CI we use Ubuntu Focal (20.04). The CI script for this
repository (`../.gitlab-ci.yml`) is the best place to look for a complete list
of dependencies for both development and deployment. Note that our CI system
runs from our cluster, which resolves the devpi access issue. For developer
laptops, you may need `sshuttle` or something similar set up to do initial
package pulls.

It is recommended to set the env variable `PIPENV_VENV_IN_PROJECT=true` when
working with pipenv. You can include this in a `.env` file.

There is a Makefile which helps with the basics. Eg:

    # install deps using pipenv
    make deps

    # run python tests
    make test

    # run code formatting and lint checks
    make fmt lint

Sometimes when developing it is helpful to enter a shell with pipenv, eg:

    pipenv shell

Often when developing it is helpful (or necessary) to set environment
variables. `pipenv shell` will read from `.env`, so you can copy and edit
`example.env`, and it will be used in tests, `pipenv shell`, etc.