pyspark - building a local docker with jupyter notebook
We can easy use docker to build us an image of pyspark with docker installed. This is what our dockerfile looks like :
# 1. Use a modern, slim Python image
FROM python:3.12-slim-bookworm
# 2. Install latest JRE using the default meta-package
# This ensures we get the latest Java version available in the standard Bookworm repos (often OpenJDK 17 or 21).
RUN apt-get update && \
apt-get install -y default-jre-headless && \
rm -rf /var/lib/apt/lists/*
# 3. Install PySpark (which includes the necessary Spark binaries)
# Using a modern PySpark version (~=4.0.0)
RUN pip install pyspark~=4.0.0 jupyterlab
# 4. Set environment variables
# Note: JAVA_HOME is often automatically detected with the default package.
# Setting it explicitly here for robustness. We'll use the default symlink path.
ENV JAVA_HOME="/usr/lib/jvm/default-java"
ENV SPARK_HOME="/usr/local/lib/python3.12/site-packages/pyspark"
ENV PATH="$PATH:$SPARK_HOME/bin"
# 5. Set the default command (for interactive notebook use)
EXPOSE 8888
CMD ["jupyter-lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]
And then we will need to build and run our image
docker build -t my-modern-pyspark-dev .
docker run -it -p 8888:8888 -v ${pwd}:/app --name modern_spark_session my-modern-pyspark-dev
Now, goto the docker consume, you will need to use the token to run pyspark. Then create a new notebook and paste in the following code in there.
from pyspark.sql import SparkSession
# Get or create a SparkSession
spark = SparkSession.builder \
.appName("PySparkVerification") \
.getOrCreate()
# Create a small sample DataFrame to test functionality
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
columns = ["Name", "ID"]
df = spark.createDataFrame(data, columns)
# Display the data and confirm the Spark UI link
print("Successfully created Spark DataFrame:")
df.show()
print(f"Spark application name: {spark.sparkContext.appName}")
# If you're running Spark locally, the UI should be available at port 4040 by default.
print(f"Spark Web UI: {spark.sparkContext.uiWebUrl}")
# Keep the session running if you plan to execute the next cell,
# or stop it if you're finished.
# spark.stop()
Next, run the cell and you will get similiar outputs
Code is available here
https://github.com/kepungnzai/pyspark-jupyter-dockerfile
Comments