Tutorial: Running PyOkera in a Container¶
This self-service tutorial guides you through a more advanced task: Installing and using PyOkera in a container (using Docker).
Difficulty: Intermediate
Time needed: 4 hours
Introduction¶
There are many options for accessing a running Okera cluster, including the Python native library provided by Okera, called PyOkera.
Since Python is a dynamic runtime environment, it is not easily possible, for example, to build a monolithic, statically linked executable file (see Executable and Linkable Format for details) that can run in a minimal container, such as provided by the Docker scratch image.
Instead, it is common in Python to install all required libraries first, and then run a script in that context.
Installing libraries in Python is not necessarily lightweight, as some require a C or C++ compiler to be present, usually provided by the operating system-specific packages containing the Gnu Compiler Collection (GCC)
For PyOkera, this is also the case because dependent libraries (such as JWT) and reference libraries such as Python Cryptography are dynamically linked by default to the Linux C++ system libraries (libstdc++
) during installation.
The difficulty when using PyOkera is then building an image that is not only as small as possible, but also has as little functionality on its own, avoiding unnecessary issues when it is scanned for vulnerabilities.
Fixing a large base image provided by someone else is much more difficult than using specific images trimmed for size and functionality, such as Alpine Linux.
This tutorial introduces you to the necessary resources provided by Okera and how to use them in a minimal container image.
Docker Image Basics¶
The most common approach to containerize an application, or in this case a Python script, is to use a container, made popular by Docker.
A special file, called Dockerfile
, defines what the container image should contain, allowing you to copy files into the virtual file system and run scripts to install any necessary dependencies.
Anything you pull in is added to the files already present in the base image, so you need to choose one that contains the necessary libraries and applications or install it as part of the image you are building.
Here is an example Dockerfile
that uses a Python 3.10.5 container image and then installs PyOkera:
FROM python:3.10.5
COPY src /app
RUN pip install --disable-pip-version-check pyokera
CMD ["python", "/app/run.py"]
Note: This tutorial assumes you are in your OS Shell or terminal and also in the directory where the
Dockerfile
is located. In addition, you will need to install thedocker
command-line tool if it is not already installed.
Let's build the image using the docker
command-line tool and then run the image in a container:
$ docker build -t pyokera-docker:0.1-large .
[+] Building 288.2s (9/9) FINISHED
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 162B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/python:3.10.5 225.4s
=> [auth] library/python:pull token for registry-1.docker.io 0.0s
=> [1/3] FROM docker.io/library/python:3.10.5@sha256:b7bfea0126f...83d12c848942fa 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 83B 0.0s
=> CACHED [2/3] COPY src /app 0.0s
=> [3/3] RUN pip install --disable-pip-version-check pyokera 59.5s
=> exporting to image 2.9s
=> => exporting layers 2.7s
=> => writing image sha256:c32bf38d5...b9d59b3caddf245d8998 0.0s
=> => naming to docker.io/library/pyokera-docker:0.1-large
$ docker run --rm pyokera-docker:0.1-large
2.9.0
The src/run.py
file that is copied into /app/run.py
inside the container contains this simple check that PyOkera is available and displays its version:
import okera.odas
print(okera.odas.version())
You can see that running the image executes the Python script and prints 2.9.0
as the PyOkera version (which was the latest
PyOkera version available when this example was run).
Next, let's look at the size of this image:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
pyokera-docker 0.1-large c32bf38d5dac 10 minutes ago 966MB
...
A whopping 966MB, just to use a simple Python script! But what is worse is that it contains so much unnecessary code, increasing the risk of vulnerabilities found by scanning the image using a common scanner:
$ docker scan pyokera-docker:0.1-large
Testing pyokera-docker:0.1-large...
✗ Low severity vulnerability found in wget
Description: Open Redirect
Info: https://snyk.io/vuln/SNYK-DEBIAN11-WGET-1277610
Introduced through: wget@1.21-1+deb11u1
From: wget@1.21-1+deb11u1
...
✗ Critical severity vulnerability found in aom/libaom0
Description: Release of Invalid Pointer or Reference
Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1290331
Introduced through: imagemagick@8:6.9.11.60+dfsg-1.3
From: imagemagick@8:6.9.11.60+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:6.9.11.60+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:6.9.11.60+dfsg-1.3 > libheif/libheif1@1.11.0-1 > aom/libaom0@1.0.0.errata1-3
✗ Critical severity vulnerability found in aom/libaom0
Description: Use After Free
Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1298721
Introduced through: imagemagick@8:6.9.11.60+dfsg-1.3
From: imagemagick@8:6.9.11.60+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:6.9.11.60+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:6.9.11.60+dfsg-1.3 > libheif/libheif1@1.11.0-1 > aom/libaom0@1.0.0.errata1-3
✗ Critical severity vulnerability found in aom/libaom0
Description: Buffer Overflow
Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1300249
Introduced through: imagemagick@8:6.9.11.60+dfsg-1.3
From: imagemagick@8:6.9.11.60+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:6.9.11.60+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:6.9.11.60+dfsg-1.3 > libheif/libheif1@1.11.0-1 > aom/libaom0@1.0.0.errata1-3
Package manager: deb
Project name: docker-image|pyokera-docker
Docker image: pyokera-docker:0.1-large
Platform: linux/amd64
Base image: python:3.10.5-bullseye
Tested 427 dependencies for known vulnerabilities, found 262 vulnerabilities.
Base Image Vulnerabilities Severity
python:3.10.5-bullseye 262 5 critical, 28 high, 6 medium, 223 low
Recommendations for base image upgrade:
Alternative image types
Base Image Vulnerabilities Severity
python:3.11.0b1-slim-buster 81 0 critical, 1 high, 0 medium, 80 low
python:3.11-rc-slim-buster 81 0 critical, 1 high, 0 medium, 80 low
python:3.11.0b1-slim-bullseye 46 2 critical, 0 high, 0 medium, 44 low
python:3.10-slim 46 2 critical, 0 high, 0 medium, 44 low
For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp
Even the recommended rc
(for release candidate) or beta images of Python have high or critical vulnerabilities associated with them.
Build a Better Image¶
As shown, choosing a large image with anything you need or installing everything yourself as part of the image building process adds to the complexity of the image and increases the likelihood of introducing possible security vulnerabilities.
There is a better approach, more reminiscent of development processes, since many of the included libraries and applications are only needed during the installation of the Python libraries. Once they are built, only dynamically linked libraries need to be present, but no compiler framework.
Akin to Maven artifacts that can be scoped for testing or development only, the Dockerfile
syntax allows us to first install and build the dependent libraries, and then drop everything else.
To do this, we use a special Dockerfile
feature, where a second FROM
line allows us to switch the base image while still allowing us to refer to the initial base image with anything added before the switch.
The image template file now looks like this:
FROM python:3.10.5 AS builder
RUN python3 -m venv /venv && \
/venv/bin/pip install --disable-pip-version-check pyokera
FROM python:3.10.5-alpine3.16 as runner
COPY --from=builder /venv /venv
COPY ./src /app
CMD ["sh", "/app/run.sh"]
The steps are as follows:
- Set the
FROM
to the heavy base image that contains the required binaries (such as the compiler). - Create a Python virtual environment (
venv
), creating a clean installation location. - Install PyOkera using the designated
pip
command-line tool of the venv. - Switch base images to a much lighter option, here based on Alpine Linux, resetting the image's file system layers.
- Copy the compiled libraries from the
venv
of the previous build stack (using the--from
option) into the fresh one. - Also copy our application or scripts.
- Set a bash script as the default command when the image is run in a container.
The last line, the CMD
option, is now pointing to the following script, called /app/run.sh
:
#!/bin/sh
source /venv/bin/activate
python /app/run.py
The only difference from our earlier build is that because we maintain the venv
as a whole, we are also required to activate it so that the subsequent call to python
is able to find all libraries, including PyOkera.
Building and running the new image is as easy as before and yields this outcome:
$ docker build -t pyokera-docker:0.1-small .
[+] Building 0.2s (11/11) FINISHED
...
$ docker run --rm pyokera-docker:0.1-small
Traceback (most recent call last):
File "/app/run.py", line 1, in <module>
import okera.odas
File "/venv/lib/python3.10/site-packages/okera/__init__.py", line 45, in <module>
from okera.odas import context, version
File "/venv/lib/python3.10/site-packages/okera/odas.py", line 15, in <module>
import jwt
File "/venv/lib/python3.10/site-packages/jwt/__init__.py", line 17, in <module>
from .jwa import std_hash_by_alg
File "/venv/lib/python3.10/site-packages/jwt/jwa.py", line 26, in <module>
from cryptography.hazmat.primitives.asymmetric import padding
File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/padding.py", line 13, in <module>
from cryptography.hazmat.primitives.asymmetric import rsa
File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/rsa.py", line 12, in <module>
from cryptography.hazmat.primitives.asymmetric import (
File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/utils.py", line 6, in <module>
from cryptography.hazmat.bindings._rust import asn1
ImportError: Error loading shared library libgcc_s.so.1: No such file or directory (needed by /venv/lib/python3.10/site-packages/cryptography/hazmat/bindings/_rust.abi3.so)
What is missing is the dynamically linked libraries, here libgcc_s.so
(and, though not shown here, the GNU Compatibility Layer).
They can be easily fixed by adding the follow RUN
statement as the second-to-last line in the Dockerfile
:
...
RUN apk add --no-cache libstdc++ gcompat
CMD ["sh", "/app/run.sh"]
After building it again, we can execute the container and get our result back:
$ docker run --rm pyokera-docker:0.1-small
2.9.0
But what is more important is the new size and security scan result:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
pyokera-docker 0.1-small c4c0e9fef845 About a minute ago 95.5MB
...
$ docker scan pyokera-docker:0.1-small
Testing pyokera-docker:0.1-small...
Package manager: apk
Project name: docker-image|pyokera-docker
Docker image: pyokera-docker:0.1-small
Platform: linux/amd64
Base image: python:3.10.5-alpine3.16
✔ Tested 42 dependencies for known vulnerabilities, no vulnerable paths found.
According to our scan, you are currently using the most secure version of the selected base image
For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp
No known vulnerabilities were found while scanning the new image! The size of the newly created container image is caused by the base images, which in turn have these sizes:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
...
python 3.10.5-alpine3.16 27edb73bd1fc 8 days ago 47.6MB
python 3.10.5 6bb8bdb609b6 8 days ago 920MB
Bottom line: we now have a small, clean container image we can use to add only what is needed for the script to run.
Next Steps¶
Let's extend these tutorial steps by moving the dependencies into a requirements.txt
file, that for sake of simplicity looks like this:
pyokera
It only lists PyOkera, but we can add anything else required by your Python scripts.
Another change we'll implement is to use the linked Python executable from the venv
environment directly, removing the extra step to wrap the script invocation into a shell script.
So the next iteration of the Dockerfile
now looks like this:
FROM python:3.10.5 AS builder
COPY requirements.txt /requirements.txt
RUN python3 -m venv /venv && \
/venv/bin/pip install --disable-pip-version-check -r /requirements.txt
FROM python:3.10.5-alpine3.16 as runner
COPY --from=builder /venv /venv
COPY ./src /app
RUN apk add --no-cache libstdc++ gcompat
CMD ["/venv/bin/python", "/app/report.py"]
Note that we also changed the name of the Python script to /app/report.py
so we can use it to generate a metadata report when the image is run in a container.
The script requires details about the cluster and a token for authentication, which we will provide (this is common in containers) using environment variables (short: envvars
) that are then read in the code using the Python os
functions:
import os
from okera import context
ctx = context()
ctx.enable_token_auth(token_str=os.environ['TOKEN'])
with ctx.connect(host=os.environ['PLANNER_HOST'], port=int(os.environ['PLANNER_PORT'])) as conn:
all_datasets = []
for db in conn.list_databases():
all_datasets.append(conn.list_dataset_names(db))
print("Number of datasets:", len(all_datasets))
The PyOkera documentation provides more information about the available metadata-related API calls.
After building the image again, setting the envvars
as appropriate for your cluster and the token from the WebUI (see Copying your Access Token), running the container shows the desired output:
$ docker build -t pyokera-docker:0.2-small .
...
$ export TOKEN=ey....
$ export PLANNER_HOST=okera.foobar.com
$ export PLANNER_PORT=12050
$ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.2-small
Number of datasets: 16
Analyze the Data¶
Finally, let's use this setup to write a script that does some actual number crunching, that is, use the data API functions of PyOkera to read datasets and perform some analytical processing.
This typically requires adding some processing frameworks into the mix, here Pandas and NumPy, which we can easily add to the requirements.txt
file:
pyokera
pandas
numpy
Note: We are still not tagging these libraries to some specific version, but you may want to do that. Otherwise, each image building job will pull in the
latest
version and that might not be what you want.
Do not be surprised when checking the image size after adding these two powerful libraries, as they contain a lot of functionality:
$ docker build -t pyokera-docker:0.3 .
...
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
pyokera-docker 0.3 6fc504838438 29 seconds ago 216MB
...
The PyOkera Scanning API has a dedicated function that reads a protected dataset using the scan_as_pandas()
function.
It returns a Pandas DataFrame instance and requires that the Pandas libraries be installed - which we ensured by updating the requirements.txt
file that is used during the image building process.
Keeping it simple for the sake of this tutorial, let's analyze the Okera-supplied example dataset called okera_sample.users
.
The following amended script, named /app/analyze.py
, contains a simple set of function calls that prints the data frame, the output of its info()
function, and then groups and sums the dataset by the gender
column:
import os
import pandas
from okera import context
ctx = context()
ctx.enable_token_auth(token_str=os.environ['TOKEN'])
with ctx.connect(host=os.environ['PLANNER_HOST'], port=int(os.environ['PLANNER_PORT'])) as conn:
pd = conn.scan_as_pandas('okera_sample.users')
print(pd)
print(pd.info())
print(pd['gender'].value_counts())
Note: These scripts are stored in the
src
directory, which resides next to the where theDockerfile
is placed. TheCOPY
statement in the image specification file copies the into the/app/
directory inside the image when the building process runs.
You need to modify the Dockerfile
so that its CMD
line now calls the newly created script:
...
CMD ["/venv/bin/python", "/app/analyze.py"]
Building the container image and executing it yields this surprise output:
$ docker build -t pyokera-docker:0.4 .
...
$ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.4
Traceback (most recent call last):
File "/app/analyze.py", line 2, in <module>
import pandas
File "/venv/lib/python3.10/site-packages/pandas/__init__.py", line 16, in <module>
raise ImportError(
ImportError: Unable to import required dependencies:
numpy:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.10 from "/venv/bin/python"
* The NumPy version is: "1.22.4"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: Error relocating /venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so: backtrace: symbol not found
...
This means that we are out of luck with our Alpine Linux-based base image, that worked well so far and allowed us to call the PyOkera metadata API functions. However, the error above indicates that Alpine Linux is not a fully compatible distribution that, while very small in terms of footprint, forfeits on certain, more low-level functionality. Unfortunately, we have to abandon this setup method.
Note: With enough tenacity you could build the libraries from source during the image building process. But there are other shortcomings of Alpine Linux in the context of Python and analyzing data with Pandas, that make switching the base image inevitable.
Without much effort, let's modify the Dockerfile
to use the Python 3.10.5-slim
image in the second phase of the image building process. This Python image is missing the GCC compiler suite and other tools required to build binary executables or libraries, but since we do not need them in the final image, the switch to the slim
version is fine.
More importantly, this base image is compatible with the dynamically linked libraries for NumPy, avoiding the error above.
The image template file looks now like this:
FROM python:3.10.5 AS builder
COPY requirements.txt /requirements.txt
RUN python3 -m venv /venv && \
/venv/bin/pip install --disable-pip-version-check -r /requirements.txt
FROM python:3.10.5-slim as runner
COPY --from=builder /venv /venv
COPY ./src /app
CMD ["/venv/bin/python", "/app/analyze.py"]
Building it shows its size as just under 300 MB, which is still better compared to the nearly 1 GB the full Python image required.
$ docker build -t pyokera-docker:0.5 .
...
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
pyokera-docker 0.5 8184f048a714 3 minutes ago 291MB
Running this image in a container completes as expected now:
$ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.5
uid ... ccn
0 b'0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D' ... b'3771-2680-8616-9487'
1 b'00071AA7-86D2-4EB9-871A-A786D27EB9BA' ... b'4539-9934-1924-5730'
2 b'00071B7D-31AF-4D85-871B-7D31AFFD852E' ... b'5580-7529-3663-6698'
3 b'0007967E-F188-4598-9C7C-E64390482CFB' ... b'6011-0440-7310-9221'
4 b'000B90B2-92DC-4A7A-8B90-B292DC9A7A71' ... b'4532-7129-7160-3161'
... ... ... ...
38450 b'FFF69D73-CE85-4BD6-B10F-9F9F25CD7A74' ... b'5231-1531-1716-3779'
38451 b'FFF9E6CB-D3A2-455F-B5CF-6B8EC4E80ABE' ... b'3488-6066-8349-1302'
38452 b'FFFB1C5E-37B6-453A-83FB-86C580D18AE8' ... b'4539-9265-2937-2473'
38453 b'NULL' ... b'5137-0001-0716-5201'
38454 b'null' ... b'3447-3737-8526-2455'
[38455 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38455 entries, 0 to 38454
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 uid 38455 non-null object
1 dob 38455 non-null object
2 gender 38455 non-null object
3 ccn 38455 non-null object
dtypes: object(4)
memory usage: 1.2+ MB
None
b'M' 15660
b'F' 14463
b'U' 8238
b' ' 94
Name: gender, dtype: int64
But as you may have expected, the slim
image is not free of vulnerabilities:
$ docker scan pyokera-docker:0.5
...
Package manager: deb
Project name: docker-image|pyokera-docker
Docker image: pyokera-docker:0.5
Platform: linux/amd64
Base image: python:3.10.5-slim-bullseye
Tested 106 dependencies for known vulnerabilities, found 46 vulnerabilities.
Base Image Vulnerabilities Severity
python:3.10.5-slim-bullseye 46 2 critical, 0 high, 0 medium, 44 low
Recommendations for base image upgrade:
Alternative image types
Base Image Vulnerabilities Severity
python:3.11.0b1-slim-buster 81 0 critical, 1 high, 0 medium, 80 low
python:3.11-rc-slim-buster 81 0 critical, 1 high, 0 medium, 80 low
For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp
These vulnerabilities would need to be discussed with your company's information security team. In fact, often the engineering team is already providing sanctioned base images that the scanners know to accept without (too much) complaining. Another option is to start with another bare minimum base Linux image and then add Python and all its required dependencies on your own - although there is no easy quick fix with this solution.
This concludes this tutorial, which shows you how to employ multilayer container images that can mitigate complexity and allow you to containerize PyOkera. Using cron-like scheduling in Kubernetes, or a container-runtime-based scheduling engine allows to runn your script completely automated, while not compromising on the ability to control the script's dependencies in a clean way.