Reducing Docker Image sizes with Multi-Stage Builds and Distroless

Imagine you are a Data Engineer at a large company with multiple deployments per day. You’re using Docker images to containerize your ETL jobs, which consume data from an external API and load it into your data warehouse.

You’ve noticed that your CI/CD pipeline takes around 15 minutes to complete, as each deployment requires building, testing, and deploying these Docker images. Since these images are deployed frequently, you’re looking to improve your pipeline’s efficiency and reduce execution time.

One solution is to optimize your Docker images by reducing their size, which is the focus of this article. Let’s dive deep into how to reduce the image sizes using Docker Multi-stages builds and Distroless.

The Docker Image

Let’s take this Docker image as an example. Imagine using this image to run your ETL process: it connects to an external API, transforms the data using Pandas, and stores the resulting DataFrame in a table in your data warehouse, in this case, DuckDB. However, many Python projects have similar structure to this.

Dockerfile
FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --upgrade pip
RUN pip install -r requirements.txt

# Copy project
COPY src/ ./src

# Set the entrypoint
ENTRYPOINT ["python", "src/main.py"]

In this Dockerfile, we have straightforward code that installs all required libraries for the project and runs the main script. In the first line, using the FROM statement, we define the python:3.9-slim distribution, which includes the necessary resources to run Python and install the project’s required libraries. So then we install all dependencies, copy the project files, and finally, define the ENTRYPOINT command, which will execute when we run the Docker image.

Although we’re using the slim version of the Python image, there are additional optimizations we can apply to reduce the Docker image size. Currently, this Docker image uses a single-stage build, meaning both the build and runtime environments are defined in the same stage. Using Docker’s multi-stage builds, we can separate the build environment from the runtime environment.

Multi-Stage build

A standard Docker image uses a single base distribution for the entire process. For example, FROM python:3.9 provides all the resources needed to build an application, including tools like pip and a shell. However, these resources aren’t all required to run the application, and they add extra weight to the final image.

Docker multi-stage builds allow you to use multiple FROM statements in your Dockerfile. This means you can create intermediate images used to build your application and then copy only the necessary artifacts into a final image (a smaller image, as we’ll see later).

The Dockerfile below applies this concept by dividing the image into two stages. Stage 1 builds the image, and Stage 2 is used to run the application.

Dockerfile
# Stage 1: Builder
FROM python:3.9-slim AS builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir --target=/install -r requirements.txt

COPY src/ ./src

# Stage 2: Final Image
FROM python:3.9-slim

COPY --from=builder /install /usr/local/lib/python3.9/site-packages/
COPY --from=builder /app/src /app/src

WORKDIR /app

ENTRYPOINT ["python3", "src/main.py"]

In Stage 1, from lines 1 to 9, the code is similar to the previous image. In line 2, we use the FROM statement to define the python:3.9-slim image and add an alias, builder, which will serve as a reference for the second stage.

The second stage begins at line 12, where we use the FROM statement again to define the base distribution. In line 14, we use the COPY statement with the --from=builder argument, which allows us to copy files from the previous stage. Here, we copy the libraries stored in /install folder to the Python packages folder /usr/local/lib/python3.9/site-packages/ of the current stage. Similarly, in line 15, we copy the Python application files from the builder stage to the current stage.

Finally, from lines 17 to 19, as in the previous image, we define the working directory and the ENTRYPOINT command.

By using a multi-stage build, we can separate the build environment from the runtime environment. This approach allows us to use a larger image with all necessary build tools for compiling the application, and then a second minimal image for running it.

This setup also enables us to do the next technique, distroless, for a more optimized final image.

Distroless images

When using a base for your image – python:3.9-slim in our case, this base image includes a package manager, a shell and extra resources, that we don’t need to run our application.

Distroless images include only your application and the essential files it needs to run, nothing extra like package managers or shells that you’d find in a typical Linux distribution. This minimal setup makes the container more secure and efficient. So the two main advantages of Distroless images are:

  • Image size: Distroless images are minimal and often much smaller than traditional base images like Debian or Alpine. This reduces storage requirements and speeds up deployments, especially in CI/CD pipelines where images are frequently downloaded.
  • Security: Distroless images reduce the attack surface, because includes only essential files and dependencies.

Combining both concepts

Let’s combine both concepts to reduce the size of our Docker images. By using a multi-stage build, we can split our application into two stages and with distroless, we can have an image which is much smaller than python:3.9-slim, and does not contain any extra resources, only the minimal necessary. This image apply both concepts:

Dockerfile
# Stage 1: Builder
FROM python:3.9-slim AS builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir --target=/install -r requirements.txt

COPY src/ ./src

# Stage 2: Final Image
FROM gcr.io/distroless/python3-debian11

COPY --from=builder /install /usr/local/lib/python3.9/site-packages/
COPY --from=builder /app/src /app/src

WORKDIR /app

ENV PYTHONPATH=/usr/local/lib/python3.9/site-packages

ENTRYPOINT ["python3", "src/main.py"]

In the second stage, at line 12, we use the Distroless image gcr.io/distroless/python3-debian11, which is much smaller than python:3.9-slim and does not contain any extra libraries. This image is based on Debian and includes only the Python 3 interpreter.

In lines 14 and 15, we copy the files needed to run our application. Then, in line 19, we define the PYTHONPATH to point to the folder where our libraries are installed.

This image take the benefit of both concepts, and can reduce the size of our image. Let’s run some extra experimentations and evaluate the results of this optimization.

Experimentation

Instead of applying this optimization approach to just one Docker image, we experimented with five different Docker images across various programming languages: Go, JavaScript (Node.js), Python, Rust, and Java. We compared the image sizes before and after the optimizations, using a standard setup versus a setup that combines multi-stage builds with distroless images.

The Python image runs code for an ETL process, connecting to an external API, transforming data with Pandas, and loading it into DuckDB. The non-optimized version uses python:3.9-slim, while the optimized version uses gcr.io/distroless/python3-debian11.

The Node.js Docker image provides an API with a single endpoint built with Express. The non-optimized image uses the node:14-slim distribution, while the optimized image uses gcr.io/distroless/nodejs.

The Rust, Java and Go images, also provides an api. The Rust api uses the gcr.io/distroless/cc and the non-optimized uses rust:1.56-slim. For Java, we use maven:3.8.4-openjdk-11-slim as the non-optimized base and gcr.io/distroless/java for the optimized version. Finally, Go uses golang:1.17-buster and gcr.io/distroless/base.

Evaluating the results in different Docker images

After applying these optimizations, we can see that all Docker images achieved better performance in terms of size reduction when using distroless images. The graph below shows, in numbers, the size reduction for each Docker image.

Also, the table below ranks the Docker images from best to worst in terms of size reduction percentage.

Docker ImageSize Before OptimizationSize After OptimizationReduction Size (%)
Rust Docker File883.62 MB37.04 MB95.81%
Go Docker File764.66 MB37.68 MB95.07%
Java Docker File501.39 MB227.47 MB54.63%
Python Docker File411.65 MB245.84 MB40.28%
NodeJS Docker File174.45 MB160.74 MB7.86%

Checking the results above, the compiled languages like Rust and Go showed the most significant improvements in reducing image size. This can be explained by the fact that they don’t need an interpreter. Unlike interpreted languages that require a larger runtime environment (such as the Python interpreter, Node.js runtime, or JVM for Java), Rust and Go applications only need the compiled binary to run. When combined with distroless images, which contain only the essentials to run the binary, the overall image size is greatly reduced.

Drawbacks to Consider

The results show we can improve our Docker images, reducing the size by up to 95%. However, using Distroless images has some downsides we need to consider before using them. These images include only what’s needed to run your application, so some tools or libraries that certain applications expect might be missing. Also, Distroless images don’t have a shell, making it hard to debug issues directly in a running container. Additionally, using multi-stage builds is essential when working with Distroless images.

Multi-stage builds can make your Dockerfile more complex. Adding stages can lengthen the file, and you’ll need to manage multiple stages. Keeping dependencies compatible and consistent across stages adds extra work.

Conclusions

After this experimentation, this article shows that we can reduce Docker image sizes by using distroless images combined with multi-stage builds. These improvements can be even more efficient with compiled programming languages like Rust and Go. In our experiments, we were able to reduce image size by 95.81%.

For engineers looking to improve CI/CD pipeline performance, these techniques provide a practical solution. Smaller images mean faster pipeline execution, saving time and resources on frequent deployments. This approach can make development workflows more efficient and scalable, leading to quicker deployments and better use of resources.

References

Share this post:

Related Articles
The Five Stages of AI in Software Engineering
Data Engineering in Azure: understand PDFs using LLMs
Organization Migration in Terraform Cloud

Interested to join our team?

We’re always looking for our next data builder. Checkout our careers page to see our current openings. Your voice powers our innovation at Data Build Company. Join our team, where your ideas are not just heard but championed, paving the way for future developments in data engineering.

Join the Data Build Company family!