pyspark - building a local docker with jupyter notebook

We can easy use docker to build us an image of pyspark with docker installed. This is what our dockerfile looks like :

# 1. Use a modern, slim Python image

FROM python:3.12-slim-bookworm

# 2. Install latest JRE using the default meta-package
# This ensures we get the latest Java version available in the standard Bookworm repos (often OpenJDK 17 or 21).
RUN apt-get update && \
    apt-get install -y default-jre-headless && \
    rm -rf /var/lib/apt/lists/*

# 3. Install PySpark (which includes the necessary Spark binaries)
# Using a modern PySpark version (~=4.0.0)
RUN pip install pyspark~=4.0.0 jupyterlab

# 4. Set environment variables
# Note: JAVA_HOME is often automatically detected with the default package.
# Setting it explicitly here for robustness. We'll use the default symlink path.
ENV JAVA_HOME="/usr/lib/jvm/default-java"
ENV SPARK_HOME="/usr/local/lib/python3.12/site-packages/pyspark"
ENV PATH="$PATH:$SPARK_HOME/bin"

# 5. Set the default command (for interactive notebook use)
EXPOSE 8888
CMD ["jupyter-lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

And then we will need to build and run our image

docker build -t my-modern-pyspark-dev .

docker run -it -p 8888:8888 -v ${pwd}:/app --name modern_spark_session my-modern-pyspark-dev


Now, goto the docker consume, you will need to use the token to run pyspark. Then create a new notebook and paste in the following code in there. 

from pyspark.sql import SparkSession

# Get or create a SparkSession
spark = SparkSession.builder \
    .appName("PySparkVerification") \
    .getOrCreate()

# Create a small sample DataFrame to test functionality
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
columns = ["Name", "ID"]
df = spark.createDataFrame(data, columns)

# Display the data and confirm the Spark UI link
print("Successfully created Spark DataFrame:")
df.show()

print(f"Spark application name: {spark.sparkContext.appName}")
# If you're running Spark locally, the UI should be available at port 4040 by default.
print(f"Spark Web UI: {spark.sparkContext.uiWebUrl}")

# Keep the session running if you plan to execute the next cell,
# or stop it if you're finished.
# spark.stop()

Next, run the cell and you will get similiar outputs



Code is available here

https://github.com/kepungnzai/pyspark-jupyter-dockerfile



Comments

Popular posts from this blog

gemini cli getting file not defined error

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20

vllm : Failed to infer device type