spark - pyspark reading from excel files


I guess a common mistake is to load the right jar file when loading excel file. Yes, you have to use version 2.11 and not 2.12, :)



You can try using the following command line


pyspark --packages com.crealytics:spark-excel_2.11:0.11.1



And use the following code to load an excel file in a data folder. If you have not created this folder, please create it and place an excel file in it.


from com.crealytics.spark.excel import *

## using spark-submit with option to execute script from command line
## spark-submit --packages spark-excel_2.11:0.11.1 excel_email_datapipeline.py

## pyspark --packages spark-excel_2.11:0.11.1
## pyspark --packages com.crealytics:spark-excel_2.11:0.11.1

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("excel-email-pipeline").getOrCreate()

df = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("inferSchema", "true").load("data/excel.xlsx")

df.show()





Comments

Unknown said…
For me it showing error plzz help

Popular posts from this blog

The specified initialization vector (IV) does not match the block size for this algorithm