databricks troubleshooting download file path using dbutils.fs.cp

 When downloading a file using dbutils.fs.cp command, sometimes we might not have a clear understanding of where we are placing this file, for example 


import os

os.environ["UNITY_CATALOG_VOLUME_PATH"] = "databrick-my-store"
os.environ["DATASET_DOWNLOAD_URL"] = "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
os.environ["DATASET_DOWNLOAD_FILENAME"] = "rows.csv"

dbutils.fs.cp(f"{os.environ.get('DATASET_DOWNLOAD_URL')}", f"{os.environ.get('UNITY_CATALOG_VOLUME_PATH')}/{os.environ.get('DATASET_DOWNLOAD_FILENAME')}")


When it is downloaded, we often tries to read it using spark.read.csv but we need to know the path to our file. 

  df = spark.read.csv("dbfs:/databrick-my-store/rows.csv", header=True, inferSchema=True)


We can quickly figure out what to pass to spark.read.csv by using the following command. Please note dbfs is not really tied to any catalog in your databricks web ui.


dbutils.fs.ls(f"{os.environ.get('UNITY_CATALOG_VOLUME_PATH')}")


Comments

Popular posts from this blog

The specified initialization vector (IV) does not match the block size for this algorithm