Tutorial: Running PyOkera in a Container¶

This self-service tutorial guides you through a more advanced task: Installing and using PyOkera in a container (using Docker).

Difficulty: Intermediate
Time needed: 4 hours

Introduction¶

There are many options for accessing a running Okera cluster, including the Python native library provided by Okera, called PyOkera. Since Python is a dynamic runtime environment, it is not easily possible, for example, to build a monolithic, statically linked executable file (see Executable and Linkable Format for details) that can run in a minimal container, such as provided by the Docker scratch image. Instead, it is common in Python to install all required libraries first, and then run a script in that context. Installing libraries in Python is not necessarily lightweight, as some require a C or C++ compiler to be present, usually provided by the operating system-specific packages containing the Gnu Compiler Collection (GCC) For PyOkera, this is also the case because dependent libraries (such as JWT) and reference libraries such as Python Cryptography are dynamically linked by default to the Linux C++ system libraries (libstdc++) during installation. The difficulty when using PyOkera is then building an image that is not only as small as possible, but also has as little functionality on its own, avoiding unnecessary issues when it is scanned for vulnerabilities. Fixing a large base image provided by someone else is much more difficult than using specific images trimmed for size and functionality, such as Alpine Linux.

This tutorial introduces you to the necessary resources provided by Okera and how to use them in a minimal container image.

Docker Image Basics¶

The most common approach to containerize an application, or in this case a Python script, is to use a container, made popular by Docker. A special file, called Dockerfile, defines what the container image should contain, allowing you to copy files into the virtual file system and run scripts to install any necessary dependencies. Anything you pull in is added to the files already present in the base image, so you need to choose one that contains the necessary libraries and applications or install it as part of the image you are building.

Here is an example Dockerfile that uses a Python 3.10.5 container image and then installs PyOkera:

FROM python:3.10.5

COPY src /app
RUN pip install --disable-pip-version-check pyokera

CMD ["python", "/app/run.py"]

Note: This tutorial assumes you are in your OS Shell or terminal and also in the directory where the Dockerfile is located. In addition, you will need to install the docker command-line tool if it is not already installed.

Let's build the image using the docker command-line tool and then run the image in a container:

$ docker build -t pyokera-docker:0.1-large .
[+] Building 288.2s (9/9) FINISHED
 => [internal] load build definition from Dockerfile                                0.1s
 => => transferring dockerfile: 162B                                                0.0s
 => [internal] load .dockerignore                                                   0.0s
 => => transferring context: 2B                                                     0.0s
 => [internal] load metadata for docker.io/library/python:3.10.5                  225.4s
 => [auth] library/python:pull token for registry-1.docker.io                       0.0s
 => [1/3] FROM docker.io/library/python:3.10.5@sha256:b7bfea0126f...83d12c848942fa  0.0s
 => [internal] load build context                                                   0.0s
 => => transferring context: 83B                                                    0.0s
 => CACHED [2/3] COPY src /app                                                      0.0s
 => [3/3] RUN pip install --disable-pip-version-check pyokera                      59.5s
 => exporting to image                                                              2.9s
 => => exporting layers                                                             2.7s
 => => writing image sha256:c32bf38d5...b9d59b3caddf245d8998                        0.0s
 => => naming to docker.io/library/pyokera-docker:0.1-large

$ docker run --rm pyokera-docker:0.1-large
2.9.0

The src/run.py file that is copied into /app/run.py inside the container contains this simple check that PyOkera is available and displays its version:

import okera.odas
print(okera.odas.version())

You can see that running the image executes the Python script and prints 2.9.0 as the PyOkera version (which was the latest PyOkera version available when this example was run). Next, let's look at the size of this image:

$ docker images
REPOSITORY                    TAG         IMAGE ID       CREATED          SIZE
pyokera-docker                0.1-large   c32bf38d5dac   10 minutes ago   966MB
...

A whopping 966MB, just to use a simple Python script! But what is worse is that it contains so much unnecessary code, increasing the risk of vulnerabilities found by scanning the image using a common scanner:

$ docker scan pyokera-docker:0.1-large

Testing pyokera-docker:0.1-large...

✗ Low severity vulnerability found in wget
  Description: Open Redirect
  Info: https://snyk.io/vuln/SNYK-DEBIAN11-WGET-1277610
  Introduced through: wget@1.21-1+deb11u1
  From: wget@1.21-1+deb11u1

...

✗ Critical severity vulnerability found in aom/libaom0
  Description: Release of Invalid Pointer or Reference
  Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1290331
  Introduced through: imagemagick@8:6.9.11.60+dfsg-1.3
  From: imagemagick@8:6.9.11.60+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:6.9.11.60+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:6.9.11.60+dfsg-1.3 > libheif/libheif1@1.11.0-1 > aom/libaom0@1.0.0.errata1-3

✗ Critical severity vulnerability found in aom/libaom0
  Description: Use After Free
  Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1298721
  Introduced through: imagemagick@8:6.9.11.60+dfsg-1.3
  From: imagemagick@8:6.9.11.60+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:6.9.11.60+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:6.9.11.60+dfsg-1.3 > libheif/libheif1@1.11.0-1 > aom/libaom0@1.0.0.errata1-3

✗ Critical severity vulnerability found in aom/libaom0
  Description: Buffer Overflow
  Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1300249
  Introduced through: imagemagick@8:6.9.11.60+dfsg-1.3
  From: imagemagick@8:6.9.11.60+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:6.9.11.60+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:6.9.11.60+dfsg-1.3 > libheif/libheif1@1.11.0-1 > aom/libaom0@1.0.0.errata1-3

Package manager:   deb
Project name:      docker-image|pyokera-docker
Docker image:      pyokera-docker:0.1-large
Platform:          linux/amd64
Base image:        python:3.10.5-bullseye

Tested 427 dependencies for known vulnerabilities, found 262 vulnerabilities.

Base Image              Vulnerabilities  Severity
python:3.10.5-bullseye  262              5 critical, 28 high, 6 medium, 223 low

Recommendations for base image upgrade:

Alternative image types
Base Image                     Vulnerabilities  Severity
python:3.11.0b1-slim-buster    81               0 critical, 1 high, 0 medium, 80 low
python:3.11-rc-slim-buster     81               0 critical, 1 high, 0 medium, 80 low
python:3.11.0b1-slim-bullseye  46               2 critical, 0 high, 0 medium, 44 low
python:3.10-slim               46               2 critical, 0 high, 0 medium, 44 low

For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp

Even the recommended rc (for release candidate) or beta images of Python have high or critical vulnerabilities associated with them.

Build a Better Image¶

As shown, choosing a large image with anything you need or installing everything yourself as part of the image building process adds to the complexity of the image and increases the likelihood of introducing possible security vulnerabilities.

There is a better approach, more reminiscent of development processes, since many of the included libraries and applications are only needed during the installation of the Python libraries. Once they are built, only dynamically linked libraries need to be present, but no compiler framework. Akin to Maven artifacts that can be scoped for testing or development only, the Dockerfile syntax allows us to first install and build the dependent libraries, and then drop everything else.

To do this, we use a special Dockerfile feature, where a second FROM line allows us to switch the base image while still allowing us to refer to the initial base image with anything added before the switch. The image template file now looks like this:

FROM python:3.10.5 AS builder

RUN python3 -m venv /venv && \
    /venv/bin/pip install --disable-pip-version-check pyokera

FROM python:3.10.5-alpine3.16 as runner

COPY --from=builder /venv /venv
COPY ./src /app

CMD ["sh", "/app/run.sh"]

The steps are as follows:

Set the FROM to the heavy base image that contains the required binaries (such as the compiler).
Create a Python virtual environment (venv), creating a clean installation location.
Install PyOkera using the designated pip command-line tool of the venv.
Switch base images to a much lighter option, here based on Alpine Linux, resetting the image's file system layers.
Copy the compiled libraries from the venv of the previous build stack (using the --from option) into the fresh one.
Also copy our application or scripts.
Set a bash script as the default command when the image is run in a container.

The last line, the CMD option, is now pointing to the following script, called /app/run.sh:

#!/bin/sh
source /venv/bin/activate
python /app/run.py

The only difference from our earlier build is that because we maintain the venv as a whole, we are also required to activate it so that the subsequent call to python is able to find all libraries, including PyOkera. Building and running the new image is as easy as before and yields this outcome:

$ docker build -t pyokera-docker:0.1-small .
[+] Building 0.2s (11/11) FINISHED
...
$ docker run --rm pyokera-docker:0.1-small
Traceback (most recent call last):
  File "/app/run.py", line 1, in <module>
    import okera.odas
  File "/venv/lib/python3.10/site-packages/okera/__init__.py", line 45, in <module>
    from okera.odas import context, version
  File "/venv/lib/python3.10/site-packages/okera/odas.py", line 15, in <module>
    import jwt
  File "/venv/lib/python3.10/site-packages/jwt/__init__.py", line 17, in <module>
    from .jwa import std_hash_by_alg
  File "/venv/lib/python3.10/site-packages/jwt/jwa.py", line 26, in <module>
    from cryptography.hazmat.primitives.asymmetric import padding
  File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/padding.py", line 13, in <module>
    from cryptography.hazmat.primitives.asymmetric import rsa
  File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/rsa.py", line 12, in <module>
    from cryptography.hazmat.primitives.asymmetric import (
  File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/utils.py", line 6, in <module>
    from cryptography.hazmat.bindings._rust import asn1
ImportError: Error loading shared library libgcc_s.so.1: No such file or directory (needed by /venv/lib/python3.10/site-packages/cryptography/hazmat/bindings/_rust.abi3.so)

What is missing is the dynamically linked libraries, here libgcc_s.so (and, though not shown here, the GNU Compatibility Layer). They can be easily fixed by adding the follow RUN statement as the second-to-last line in the Dockerfile:

...
RUN apk add --no-cache libstdc++ gcompat

CMD ["sh", "/app/run.sh"]

After building it again, we can execute the container and get our result back:

$ docker run --rm pyokera-docker:0.1-small
2.9.0

But what is more important is the new size and security scan result:

$ docker images
REPOSITORY          TAG          IMAGE ID       CREATED              SIZE
pyokera-docker      0.1-small    c4c0e9fef845   About a minute ago   95.5MB
...

$ docker scan pyokera-docker:0.1-small
Testing pyokera-docker:0.1-small...

Package manager:   apk
Project name:      docker-image|pyokera-docker
Docker image:      pyokera-docker:0.1-small
Platform:          linux/amd64
Base image:        python:3.10.5-alpine3.16

✔ Tested 42 dependencies for known vulnerabilities, no vulnerable paths found.

According to our scan, you are currently using the most secure version of the selected base image

For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp

No known vulnerabilities were found while scanning the new image! The size of the newly created container image is caused by the base images, which in turn have these sizes:

$ docker images
REPOSITORY          TAG                 IMAGE ID       CREATED       SIZE
...
python              3.10.5-alpine3.16   27edb73bd1fc   8 days ago    47.6MB
python              3.10.5              6bb8bdb609b6   8 days ago    920MB

Bottom line: we now have a small, clean container image we can use to add only what is needed for the script to run.

Next Steps¶

Let's extend these tutorial steps by moving the dependencies into a requirements.txt file, that for sake of simplicity looks like this:

pyokera

It only lists PyOkera, but we can add anything else required by your Python scripts. Another change we'll implement is to use the linked Python executable from the venv environment directly, removing the extra step to wrap the script invocation into a shell script. So the next iteration of the Dockerfile now looks like this:

FROM python:3.10.5 AS builder

COPY requirements.txt /requirements.txt
RUN python3 -m venv /venv && \
    /venv/bin/pip install --disable-pip-version-check -r /requirements.txt

FROM python:3.10.5-alpine3.16 as runner

COPY --from=builder /venv /venv
COPY ./src /app

RUN apk add --no-cache libstdc++ gcompat

CMD ["/venv/bin/python", "/app/report.py"]

Note that we also changed the name of the Python script to /app/report.py so we can use it to generate a metadata report when the image is run in a container.

The script requires details about the cluster and a token for authentication, which we will provide (this is common in containers) using environment variables (short: envvars) that are then read in the code using the Python os functions:

import os
from okera import context

ctx = context()
ctx.enable_token_auth(token_str=os.environ['TOKEN'])
with ctx.connect(host=os.environ['PLANNER_HOST'], port=int(os.environ['PLANNER_PORT'])) as conn:
    all_datasets = []
    for db in conn.list_databases():
        all_datasets.append(conn.list_dataset_names(db))
    print("Number of datasets:", len(all_datasets))

The PyOkera documentation provides more information about the available metadata-related API calls. After building the image again, setting the envvars as appropriate for your cluster and the token from the WebUI (see Copying your Access Token), running the container shows the desired output:

$ docker build -t pyokera-docker:0.2-small .
...
$ export TOKEN=ey....
$ export PLANNER_HOST=okera.foobar.com
$ export PLANNER_PORT=12050
$ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.2-small
Number of datasets: 16

Analyze the Data¶

Finally, let's use this setup to write a script that does some actual number crunching, that is, use the data API functions of PyOkera to read datasets and perform some analytical processing. This typically requires adding some processing frameworks into the mix, here Pandas and NumPy, which we can easily add to the requirements.txt file:

pyokera
pandas
numpy

Note: We are still not tagging these libraries to some specific version, but you may want to do that. Otherwise, each image building job will pull in the latest version and that might not be what you want.

Do not be surprised when checking the image size after adding these two powerful libraries, as they contain a lot of functionality:

$ docker build -t pyokera-docker:0.3 .
...
$ docker images
REPOSITORY           TAG       IMAGE ID       CREATED             SIZE
pyokera-docker       0.3       6fc504838438   29 seconds ago      216MB
...

The PyOkera Scanning API has a dedicated function that reads a protected dataset using the scan_as_pandas() function. It returns a Pandas DataFrame instance and requires that the Pandas libraries be installed - which we ensured by updating the requirements.txt file that is used during the image building process.

Keeping it simple for the sake of this tutorial, let's analyze the Okera-supplied example dataset called okera_sample.users. The following amended script, named /app/analyze.py, contains a simple set of function calls that prints the data frame, the output of its info() function, and then groups and sums the dataset by the gender column:

import os
import pandas
from okera import context

ctx = context()
ctx.enable_token_auth(token_str=os.environ['TOKEN'])
with ctx.connect(host=os.environ['PLANNER_HOST'], port=int(os.environ['PLANNER_PORT'])) as conn:
    pd = conn.scan_as_pandas('okera_sample.users')
    print(pd)
    print(pd.info())
    print(pd['gender'].value_counts())

Note: These scripts are stored in the src directory, which resides next to the where the Dockerfile is placed. The COPY statement in the image specification file copies the into the /app/ directory inside the image when the building process runs.

You need to modify the Dockerfile so that its CMD line now calls the newly created script:

...
CMD ["/venv/bin/python", "/app/analyze.py"]

Building the container image and executing it yields this surprise output:

$ docker build -t pyokera-docker:0.4 .
...
$ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.4
Traceback (most recent call last):
  File "/app/analyze.py", line 2, in <module>
    import pandas
  File "/venv/lib/python3.10/site-packages/pandas/__init__.py", line 16, in <module>
    raise ImportError(
ImportError: Unable to import required dependencies:
numpy:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.10 from "/venv/bin/python"
  * The NumPy version is: "1.22.4"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: Error relocating /venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so: backtrace: symbol not found
...

This means that we are out of luck with our Alpine Linux-based base image, that worked well so far and allowed us to call the PyOkera metadata API functions. However, the error above indicates that Alpine Linux is not a fully compatible distribution that, while very small in terms of footprint, forfeits on certain, more low-level functionality. Unfortunately, we have to abandon this setup method.

Note: With enough tenacity you could build the libraries from source during the image building process. But there are other shortcomings of Alpine Linux in the context of Python and analyzing data with Pandas, that make switching the base image inevitable.

Without much effort, let's modify the Dockerfile to use the Python 3.10.5-slim image in the second phase of the image building process. This Python image is missing the GCC compiler suite and other tools required to build binary executables or libraries, but since we do not need them in the final image, the switch to the slim version is fine. More importantly, this base image is compatible with the dynamically linked libraries for NumPy, avoiding the error above. The image template file looks now like this:

FROM python:3.10.5 AS builder

COPY requirements.txt /requirements.txt
RUN python3 -m venv /venv && \
    /venv/bin/pip install --disable-pip-version-check -r /requirements.txt

FROM python:3.10.5-slim as runner

COPY --from=builder /venv /venv
COPY ./src /app

CMD ["/venv/bin/python", "/app/analyze.py"]

Building it shows its size as just under 300 MB, which is still better compared to the nearly 1 GB the full Python image required.

$ docker build -t pyokera-docker:0.5 .
...
$ docker images
REPOSITORY           TAG       IMAGE ID       CREATED          SIZE
pyokera-docker       0.5       8184f048a714   3 minutes ago    291MB

Running this image in a container completes as expected now:

$ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.5
                                           uid  ...                     ccn
0      b'0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D'  ...  b'3771-2680-8616-9487'
1      b'00071AA7-86D2-4EB9-871A-A786D27EB9BA'  ...  b'4539-9934-1924-5730'
2      b'00071B7D-31AF-4D85-871B-7D31AFFD852E'  ...  b'5580-7529-3663-6698'
3      b'0007967E-F188-4598-9C7C-E64390482CFB'  ...  b'6011-0440-7310-9221'
4      b'000B90B2-92DC-4A7A-8B90-B292DC9A7A71'  ...  b'4532-7129-7160-3161'
...                                        ...  ...                     ...
38450  b'FFF69D73-CE85-4BD6-B10F-9F9F25CD7A74'  ...  b'5231-1531-1716-3779'
38451  b'FFF9E6CB-D3A2-455F-B5CF-6B8EC4E80ABE'  ...  b'3488-6066-8349-1302'
38452  b'FFFB1C5E-37B6-453A-83FB-86C580D18AE8'  ...  b'4539-9265-2937-2473'
38453                                  b'NULL'  ...  b'5137-0001-0716-5201'
38454                                  b'null'  ...  b'3447-3737-8526-2455'

[38455 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38455 entries, 0 to 38454
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   uid     38455 non-null  object
 1   dob     38455 non-null  object
 2   gender  38455 non-null  object
 3   ccn     38455 non-null  object
dtypes: object(4)
memory usage: 1.2+ MB
None
b'M'    15660
b'F'    14463
b'U'     8238
b' '       94
Name: gender, dtype: int64

But as you may have expected, the slim image is not free of vulnerabilities:

$ docker scan pyokera-docker:0.5
...
Package manager:   deb
Project name:      docker-image|pyokera-docker
Docker image:      pyokera-docker:0.5
Platform:          linux/amd64
Base image:        python:3.10.5-slim-bullseye

Tested 106 dependencies for known vulnerabilities, found 46 vulnerabilities.

Base Image                   Vulnerabilities  Severity
python:3.10.5-slim-bullseye  46               2 critical, 0 high, 0 medium, 44 low

Recommendations for base image upgrade:

Alternative image types
Base Image                   Vulnerabilities  Severity
python:3.11.0b1-slim-buster  81               0 critical, 1 high, 0 medium, 80 low
python:3.11-rc-slim-buster   81               0 critical, 1 high, 0 medium, 80 low

For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp

These vulnerabilities would need to be discussed with your company's information security team. In fact, often the engineering team is already providing sanctioned base images that the scanners know to accept without (too much) complaining. Another option is to start with another bare minimum base Linux image and then add Python and all its required dependencies on your own - although there is no easy quick fix with this solution.

This concludes this tutorial, which shows you how to employ multilayer container images that can mitigate complexity and allow you to containerize PyOkera. Using cron-like scheduling in Kubernetes, or a container-runtime-based scheduling engine allows to runn your script completely automated, while not compromising on the ability to control the script's dependencies in a clean way.