databricks troubleshooting download file path using dbutils.fs.cp
When downloading a file using dbutils.fs.cp command, sometimes we might not have a clear understanding of where we are placing this file, for example
import os
os.environ["UNITY_CATALOG_VOLUME_PATH"] = "databrick-my-store"
os.environ["DATASET_DOWNLOAD_URL"] = "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
os.environ["DATASET_DOWNLOAD_FILENAME"] = "rows.csv"
dbutils.fs.cp(f"{os.environ.get('DATASET_DOWNLOAD_URL')}", f"{os.environ.get('UNITY_CATALOG_VOLUME_PATH')}/{os.environ.get('DATASET_DOWNLOAD_FILENAME')}")
When it is downloaded, we often tries to read it using spark.read.csv but we need to know the path to our file.
df = spark.read.csv("dbfs:/databrick-my-store/rows.csv", header=True, inferSchema=True)
We can quickly figure out what to pass to spark.read.csv by using the following command. Please note dbfs is not really tied to any catalog in your databricks web ui.
dbutils.fs.ls(f"{os.environ.get('UNITY_CATALOG_VOLUME_PATH')}")
Comments