databricks troubleshooting download file path using dbutils.fs.cp

September 17, 2024

When downloading a file using dbutils.fs.cp command, sometimes we might not have a clear understanding of where we are placing this file, for example

import os

os.environ["UNITY_CATALOG_VOLUME_PATH"] = "databrick-my-store"
os.environ["DATASET_DOWNLOAD_URL"] = "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
os.environ["DATASET_DOWNLOAD_FILENAME"] = "rows.csv"

dbutils.fs.cp(f"{os.environ.get('DATASET_DOWNLOAD_URL')}", f"{os.environ.get('UNITY_CATALOG_VOLUME_PATH')}/{os.environ.get('DATASET_DOWNLOAD_FILENAME')}")

When it is downloaded, we often tries to read it using spark.read.csv but we need to know the path to our file.

  df = spark.read.csv("dbfs:/databrick-my-store/rows.csv", header=True, inferSchema=True)

We can quickly figure out what to pass to spark.read.csv by using the following command. Please note dbfs is not really tied to any catalog in your databricks web ui.

dbutils.fs.ls(f"{os.environ.get('UNITY_CATALOG_VOLUME_PATH')}")

Search This Blog

mitzen

databricks troubleshooting download file path using dbutils.fs.cp

Comments

Popular posts from this blog

Nextjs - How do you handle onclick which do something

The specified initialization vector (IV) does not match the block size for this algorithm

Azure function error : Missing value for AzureWebJobsStorage in local.settings.json