Tutorial: Running PyOkera in a Container¶
This self-service tutorial guides you through a more advanced task: Installing and using PyOkera in a container (using Docker).
Time needed: 4 hours
There are many options for accessing a running Okera cluster, including the Python native library provided by Okera, called PyOkera.
Since Python is a dynamic runtime environment, it is not easily possible, for example, to build a monolithic, statically linked executable file (see Executable and Linkable Format for details) that can run in a minimal container, such as provided by the Docker scratch image.
Instead, it is common in Python to install all required libraries first, and then run a script in that context.
Installing libraries in Python is not necessarily lightweight, as some require a C or C++ compiler to be present, usually provided by the operating system-specific packages containing the Gnu Compiler Collection (GCC)
For PyOkera, this is also the case because dependent libraries (such as JWT) and reference libraries such as Python Cryptography are dynamically linked by default to the Linux C++ system libraries (
libstdc++) during installation.
The difficulty when using PyOkera is then building an image that is not only as small as possible, but also has as little functionality on its own, avoiding unnecessary issues when it is scanned for vulnerabilities.
Fixing a large base image provided by someone else is much more difficult than using specific images trimmed for size and functionality, such as Alpine Linux.
This tutorial introduces you to the necessary resources provided by Okera and how to use them in a minimal container image.
Docker Image Basics¶
The most common approach to containerize an application, or in this case a Python script, is to use a container, made popular by Docker.
A special file, called
Dockerfile, defines what the container image should contain, allowing you to copy files into the virtual file system and run scripts to install any necessary dependencies.
Anything you pull in is added to the files already present in the base image, so you need to choose one that contains the necessary libraries and applications or install it as part of the image you are building.
FROM python:3.10.5 COPY src /app RUN pip install --disable-pip-version-check pyokera CMD ["python", "/app/run.py"]
Note: This tutorial assumes you are in your OS Shell or terminal and also in the directory where the
Dockerfileis located. In addition, you will need to install the
dockercommand-line tool if it is not already installed.
Let's build the image using the
docker command-line tool and then run the image in a container:
$ docker build -t pyokera-docker:0.1-large . [+] Building 288.2s (9/9) FINISHED => [internal] load build definition from Dockerfile 0.1s => => transferring dockerfile: 162B 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load metadata for docker.io/library/python:3.10.5 225.4s => [auth] library/python:pull token for registry-1.docker.io 0.0s => [1/3] FROM docker.io/library/python:3.10.5@sha256:b7bfea0126f...83d12c848942fa 0.0s => [internal] load build context 0.0s => => transferring context: 83B 0.0s => CACHED [2/3] COPY src /app 0.0s => [3/3] RUN pip install --disable-pip-version-check pyokera 59.5s => exporting to image 2.9s => => exporting layers 2.7s => => writing image sha256:c32bf38d5...b9d59b3caddf245d8998 0.0s => => naming to docker.io/library/pyokera-docker:0.1-large $ docker run --rm pyokera-docker:0.1-large 2.9.0
src/run.py file that is copied into
/app/run.py inside the container contains this simple check that PyOkera is available and displays its version:
import okera.odas print(okera.odas.version())
You can see that running the image executes the Python script and prints
2.9.0 as the PyOkera version (which was the
latest PyOkera version available when this example was run).
Next, let's look at the size of this image:
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE pyokera-docker 0.1-large c32bf38d5dac 10 minutes ago 966MB ...
A whopping 966MB, just to use a simple Python script! But what is worse is that it contains so much unnecessary code, increasing the risk of vulnerabilities found by scanning the image using a common scanner:
$ docker scan pyokera-docker:0.1-large Testing pyokera-docker:0.1-large... ✗ Low severity vulnerability found in wget Description: Open Redirect Info: https://snyk.io/vuln/SNYK-DEBIAN11-WGET-1277610 Introduced through: firstname.lastname@example.org+deb11u1 From: email@example.com+deb11u1 ... ✗ Critical severity vulnerability found in aom/libaom0 Description: Release of Invalid Pointer or Reference Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1290331 Introduced through: imagemagick@8:126.96.36.199+dfsg-1.3 From: imagemagick@8:188.8.131.52+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:184.108.40.206+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:220.127.116.11+dfsg-1.3 > firstname.lastname@example.org > email@example.com ✗ Critical severity vulnerability found in aom/libaom0 Description: Use After Free Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1298721 Introduced through: imagemagick@8:18.104.22.168+dfsg-1.3 From: imagemagick@8:22.214.171.124+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:126.96.36.199+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:188.8.131.52+dfsg-1.3 > firstname.lastname@example.org > email@example.com ✗ Critical severity vulnerability found in aom/libaom0 Description: Buffer Overflow Info: https://snyk.io/vuln/SNYK-DEBIAN11-AOM-1300249 Introduced through: imagemagick@8:184.108.40.206+dfsg-1.3 From: imagemagick@8:220.127.116.11+dfsg-1.3 > imagemagick/imagemagick-6.q16@8:18.104.22.168+dfsg-1.3 > imagemagick/libmagickcore-6.q16-6@8:22.214.171.124+dfsg-1.3 > firstname.lastname@example.org > email@example.com Package manager: deb Project name: docker-image|pyokera-docker Docker image: pyokera-docker:0.1-large Platform: linux/amd64 Base image: python:3.10.5-bullseye Tested 427 dependencies for known vulnerabilities, found 262 vulnerabilities. Base Image Vulnerabilities Severity python:3.10.5-bullseye 262 5 critical, 28 high, 6 medium, 223 low Recommendations for base image upgrade: Alternative image types Base Image Vulnerabilities Severity python:3.11.0b1-slim-buster 81 0 critical, 1 high, 0 medium, 80 low python:3.11-rc-slim-buster 81 0 critical, 1 high, 0 medium, 80 low python:3.11.0b1-slim-bullseye 46 2 critical, 0 high, 0 medium, 44 low python:3.10-slim 46 2 critical, 0 high, 0 medium, 44 low For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp
Even the recommended
rc (for release candidate) or beta images of Python have high or critical vulnerabilities associated with them.
Build a Better Image¶
As shown, choosing a large image with anything you need or installing everything yourself as part of the image building process adds to the complexity of the image and increases the likelihood of introducing possible security vulnerabilities.
There is a better approach, more reminiscent of development processes, since many of the included libraries and applications are only needed during the installation of the Python libraries. Once they are built, only dynamically linked libraries need to be present, but no compiler framework.
Akin to Maven artifacts that can be scoped for testing or development only, the
Dockerfile syntax allows us to first install and build the dependent libraries, and then drop everything else.
To do this, we use a special
Dockerfile feature, where a second
FROM line allows us to switch the base image while still allowing us to refer to the initial base image with anything added before the switch.
The image template file now looks like this:
FROM python:3.10.5 AS builder RUN python3 -m venv /venv && \ /venv/bin/pip install --disable-pip-version-check pyokera FROM python:3.10.5-alpine3.16 as runner COPY --from=builder /venv /venv COPY ./src /app CMD ["sh", "/app/run.sh"]
The steps are as follows:
- Set the
FROMto the heavy base image that contains the required binaries (such as the compiler).
- Create a Python virtual environment (
venv), creating a clean installation location.
- Install PyOkera using the designated
pipcommand-line tool of the venv.
- Switch base images to a much lighter option, here based on Alpine Linux, resetting the image's file system layers.
- Copy the compiled libraries from the
venvof the previous build stack (using the
--fromoption) into the fresh one.
- Also copy our application or scripts.
- Set a bash script as the default command when the image is run in a container.
The last line, the
CMD option, is now pointing to the following script, called
#!/bin/sh source /venv/bin/activate python /app/run.py
The only difference from our earlier build is that because we maintain the
venv as a whole, we are also required to activate it so that the subsequent call to
python is able to find all libraries, including PyOkera.
Building and running the new image is as easy as before and yields this outcome:
$ docker build -t pyokera-docker:0.1-small . [+] Building 0.2s (11/11) FINISHED ... $ docker run --rm pyokera-docker:0.1-small Traceback (most recent call last): File "/app/run.py", line 1, in <module> import okera.odas File "/venv/lib/python3.10/site-packages/okera/__init__.py", line 45, in <module> from okera.odas import context, version File "/venv/lib/python3.10/site-packages/okera/odas.py", line 15, in <module> import jwt File "/venv/lib/python3.10/site-packages/jwt/__init__.py", line 17, in <module> from .jwa import std_hash_by_alg File "/venv/lib/python3.10/site-packages/jwt/jwa.py", line 26, in <module> from cryptography.hazmat.primitives.asymmetric import padding File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/padding.py", line 13, in <module> from cryptography.hazmat.primitives.asymmetric import rsa File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/rsa.py", line 12, in <module> from cryptography.hazmat.primitives.asymmetric import ( File "/venv/lib/python3.10/site-packages/cryptography/hazmat/primitives/asymmetric/utils.py", line 6, in <module> from cryptography.hazmat.bindings._rust import asn1 ImportError: Error loading shared library libgcc_s.so.1: No such file or directory (needed by /venv/lib/python3.10/site-packages/cryptography/hazmat/bindings/_rust.abi3.so)
What is missing is the dynamically linked libraries, here
libgcc_s.so (and, though not shown here, the GNU Compatibility Layer).
They can be easily fixed by adding the follow
RUN statement as the second-to-last line in the
... RUN apk add --no-cache libstdc++ gcompat CMD ["sh", "/app/run.sh"]
After building it again, we can execute the container and get our result back:
$ docker run --rm pyokera-docker:0.1-small 2.9.0
But what is more important is the new size and security scan result:
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE pyokera-docker 0.1-small c4c0e9fef845 About a minute ago 95.5MB ... $ docker scan pyokera-docker:0.1-small Testing pyokera-docker:0.1-small... Package manager: apk Project name: docker-image|pyokera-docker Docker image: pyokera-docker:0.1-small Platform: linux/amd64 Base image: python:3.10.5-alpine3.16 ✔ Tested 42 dependencies for known vulnerabilities, no vulnerable paths found. According to our scan, you are currently using the most secure version of the selected base image For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp
No known vulnerabilities were found while scanning the new image! The size of the newly created container image is caused by the base images, which in turn have these sizes:
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE ... python 3.10.5-alpine3.16 27edb73bd1fc 8 days ago 47.6MB python 3.10.5 6bb8bdb609b6 8 days ago 920MB
Bottom line: we now have a small, clean container image we can use to add only what is needed for the script to run.
Let's extend these tutorial steps by moving the dependencies into a
requirements.txt file, that for sake of simplicity looks like this:
It only lists PyOkera, but we can add anything else required by your Python scripts.
Another change we'll implement is to use the linked Python executable from the
venv environment directly, removing the extra step to wrap the script invocation into a shell script.
So the next iteration of the
Dockerfile now looks like this:
FROM python:3.10.5 AS builder COPY requirements.txt /requirements.txt RUN python3 -m venv /venv && \ /venv/bin/pip install --disable-pip-version-check -r /requirements.txt FROM python:3.10.5-alpine3.16 as runner COPY --from=builder /venv /venv COPY ./src /app RUN apk add --no-cache libstdc++ gcompat CMD ["/venv/bin/python", "/app/report.py"]
Note that we also changed the name of the Python script to
/app/report.py so we can use it to generate a metadata report when the image is run in a container.
The script requires details about the cluster and a token for authentication, which we will provide (this is common in containers) using environment variables (short:
envvars) that are then read in the code using the Python
import os from okera import context ctx = context() ctx.enable_token_auth(token_str=os.environ['TOKEN']) with ctx.connect(host=os.environ['PLANNER_HOST'], port=int(os.environ['PLANNER_PORT'])) as conn: all_datasets =  for db in conn.list_databases(): all_datasets.append(conn.list_dataset_names(db)) print("Number of datasets:", len(all_datasets))
The PyOkera documentation provides more information about the available metadata-related API calls.
After building the image again, setting the
envvars as appropriate for your cluster and the token from the WebUI (see Copying your Access Token), running the container shows the desired output:
$ docker build -t pyokera-docker:0.2-small . ... $ export TOKEN=ey.... $ export PLANNER_HOST=okera.foobar.com $ export PLANNER_PORT=12050 $ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.2-small Number of datasets: 16
Analyze the Data¶
Finally, let's use this setup to write a script that does some actual number crunching, that is, use the data API functions of PyOkera to read datasets and perform some analytical processing.
This typically requires adding some processing frameworks into the mix, here Pandas and NumPy, which we can easily add to the
pyokera pandas numpy
Note: We are still not tagging these libraries to some specific version, but you may want to do that. Otherwise, each image building job will pull in the
latestversion and that might not be what you want.
Do not be surprised when checking the image size after adding these two powerful libraries, as they contain a lot of functionality:
$ docker build -t pyokera-docker:0.3 . ... $ docker images REPOSITORY TAG IMAGE ID CREATED SIZE pyokera-docker 0.3 6fc504838438 29 seconds ago 216MB ...
The PyOkera Scanning API has a dedicated function that reads a protected dataset using the
It returns a Pandas DataFrame instance and requires that the Pandas libraries be installed - which we ensured by updating the
requirements.txt file that is used during the image building process.
Keeping it simple for the sake of this tutorial, let's analyze the Okera-supplied example dataset called
The following amended script, named
/app/analyze.py, contains a simple set of function calls that prints the data frame, the output of its
info() function, and then groups and sums the dataset by the
import os import pandas from okera import context ctx = context() ctx.enable_token_auth(token_str=os.environ['TOKEN']) with ctx.connect(host=os.environ['PLANNER_HOST'], port=int(os.environ['PLANNER_PORT'])) as conn: pd = conn.scan_as_pandas('okera_sample.users') print(pd) print(pd.info()) print(pd['gender'].value_counts())
Note: These scripts are stored in the
srcdirectory, which resides next to the where the
Dockerfileis placed. The
COPYstatement in the image specification file copies the into the
/app/directory inside the image when the building process runs.
You need to modify the
Dockerfile so that its
CMD line now calls the newly created script:
... CMD ["/venv/bin/python", "/app/analyze.py"]
Building the container image and executing it yields this surprise output:
$ docker build -t pyokera-docker:0.4 . ... $ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.4 Traceback (most recent call last): File "/app/analyze.py", line 2, in <module> import pandas File "/venv/lib/python3.10/site-packages/pandas/__init__.py", line 16, in <module> raise ImportError( ImportError: Unable to import required dependencies: numpy: IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE! Importing the numpy C-extensions failed. This error can happen for many reasons, often due to issues with your setup or how NumPy was installed. We have compiled some common reasons and troubleshooting tips at: https://numpy.org/devdocs/user/troubleshooting-importerror.html Please note and check the following: * The Python version is: Python3.10 from "/venv/bin/python" * The NumPy version is: "1.22.4" and make sure that they are the versions you expect. Please carefully study the documentation linked above for further help. Original error was: Error relocating /venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so: backtrace: symbol not found ...
This means that we are out of luck with our Alpine Linux-based base image, that worked well so far and allowed us to call the PyOkera metadata API functions. However, the error above indicates that Alpine Linux is not a fully compatible distribution that, while very small in terms of footprint, forfeits on certain, more low-level functionality. Unfortunately, we have to abandon this setup method.
Note: With enough tenacity you could build the libraries from source during the image building process. But there are other shortcomings of Alpine Linux in the context of Python and analyzing data with Pandas, that make switching the base image inevitable.
Without much effort, let's modify the
Dockerfile to use the Python
3.10.5-slim image in the second phase of the image building process. This Python image is missing the GCC compiler suite and other tools required to build binary executables or libraries, but since we do not need them in the final image, the switch to the
slim version is fine.
More importantly, this base image is compatible with the dynamically linked libraries for NumPy, avoiding the error above.
The image template file looks now like this:
FROM python:3.10.5 AS builder COPY requirements.txt /requirements.txt RUN python3 -m venv /venv && \ /venv/bin/pip install --disable-pip-version-check -r /requirements.txt FROM python:3.10.5-slim as runner COPY --from=builder /venv /venv COPY ./src /app CMD ["/venv/bin/python", "/app/analyze.py"]
Building it shows its size as just under 300 MB, which is still better compared to the nearly 1 GB the full Python image required.
$ docker build -t pyokera-docker:0.5 . ... $ docker images REPOSITORY TAG IMAGE ID CREATED SIZE pyokera-docker 0.5 8184f048a714 3 minutes ago 291MB
Running this image in a container completes as expected now:
$ docker run --rm -e TOKEN=$TOKEN -e PLANNER_HOST=$PLANNER_HOST -e PLANNER_PORT=$PLANNER_PORT pyokera-docker:0.5 uid ... ccn 0 b'0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D' ... b'3771-2680-8616-9487' 1 b'00071AA7-86D2-4EB9-871A-A786D27EB9BA' ... b'4539-9934-1924-5730' 2 b'00071B7D-31AF-4D85-871B-7D31AFFD852E' ... b'5580-7529-3663-6698' 3 b'0007967E-F188-4598-9C7C-E64390482CFB' ... b'6011-0440-7310-9221' 4 b'000B90B2-92DC-4A7A-8B90-B292DC9A7A71' ... b'4532-7129-7160-3161' ... ... ... ... 38450 b'FFF69D73-CE85-4BD6-B10F-9F9F25CD7A74' ... b'5231-1531-1716-3779' 38451 b'FFF9E6CB-D3A2-455F-B5CF-6B8EC4E80ABE' ... b'3488-6066-8349-1302' 38452 b'FFFB1C5E-37B6-453A-83FB-86C580D18AE8' ... b'4539-9265-2937-2473' 38453 b'NULL' ... b'5137-0001-0716-5201' 38454 b'null' ... b'3447-3737-8526-2455' [38455 rows x 4 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 38455 entries, 0 to 38454 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 uid 38455 non-null object 1 dob 38455 non-null object 2 gender 38455 non-null object 3 ccn 38455 non-null object dtypes: object(4) memory usage: 1.2+ MB None b'M' 15660 b'F' 14463 b'U' 8238 b' ' 94 Name: gender, dtype: int64
But as you may have expected, the
slim image is not free of vulnerabilities:
$ docker scan pyokera-docker:0.5 ... Package manager: deb Project name: docker-image|pyokera-docker Docker image: pyokera-docker:0.5 Platform: linux/amd64 Base image: python:3.10.5-slim-bullseye Tested 106 dependencies for known vulnerabilities, found 46 vulnerabilities. Base Image Vulnerabilities Severity python:3.10.5-slim-bullseye 46 2 critical, 0 high, 0 medium, 44 low Recommendations for base image upgrade: Alternative image types Base Image Vulnerabilities Severity python:3.11.0b1-slim-buster 81 0 critical, 1 high, 0 medium, 80 low python:3.11-rc-slim-buster 81 0 critical, 1 high, 0 medium, 80 low For more free scans that keep your images secure, sign up to Snyk at https://dockr.ly/3ePqVcp
These vulnerabilities would need to be discussed with your company's information security team. In fact, often the engineering team is already providing sanctioned base images that the scanners know to accept without (too much) complaining. Another option is to start with another bare minimum base Linux image and then add Python and all its required dependencies on your own - although there is no easy quick fix with this solution.
This concludes this tutorial, which shows you how to employ multilayer container images that can mitigate complexity and allow you to containerize PyOkera. Using cron-like scheduling in Kubernetes, or a container-runtime-based scheduling engine allows to runn your script completely automated, while not compromising on the ability to control the script's dependencies in a clean way.