Skip to main content

Databricks

Spark DataFrame: Writing into Files

Desk 3076954 1920

This blog post explores how to write Spark DataFrame into various file formats for saving data to external storage for further analysis or sharing.

Before diving into this blog have a look at my other blog posts discussing about creating the DataFrame and manipulating the DataFrame along with writing a DataFrame into tables and views.

Dataset:

The Below is the Dataset that we will be using for looking on writing into a file from DataFrame.

Dataset

Writing Spark DataFrame to File:

CSV Format:

Below is the Syntax to write a Spark DataFrame into a csv.

df.write.csv("output_path")

Lets go over writing the DataFrame to File using examples and scenarios.

Example:

The below snapshot is the sample for writing a DataFrame into a File.

Spark DataFrame Write to File - Display From The Path

After writing the DataFrame into the path, the files in the path are displayed. The displayed Part Files are the ones where the data is loaded. Databricks automatically partitioned each row into a file and created a file for all of the rows. We can repartition and create a single file from the DataFrame.

DataFrame Repartition:

Spark DataFrame Write into csv - Display From The Path Repartition

After repartitioning, we observe that all the part files combine into a single file, and we notice other files besides the part files, which we can ignore from creating by using some Spark configurations below. These files will be created even when writing the data into other file formats rather than csv.

Removing _committed and _started Files:

We can use the below spark configuration which will not create the files starting with _commited and _started. =

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

Spark DataFrame Write to File -Display From The Path Commit Protocol

Removing _SUCCESS File:

We can use the below spark configuration to stop the _SUCCESS file from getting generated.

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Spark DataFrame Write to File -Display From The Path Makesuccessfulljobs

Data in the File:

With all the additional files removed we can see the data present within what is being loaded into the file. We can notice that by default spark doesn’t write header into the files we can modify them by using option/options. In addition, let’s see the available options when writing a DataFrame into a file.

Output From The File

Header Option:

Dataframe Write With Header

By adding the header option, we observe that the header is populated in the file. Similarly, we have a option to change the delimiter.

Delimiter Option:

Dataframe Write With Delimiter

 

We can change the delimiter to our desired format by adding the additional option – delimiter or we can also use sep (syntax provided below).

df.write.option("header","true").option("sep","|").mode("overwrite").csv("dbfs:/FileStore/df_write/")

nullValue Option:

From the previous output we can notice that the Capital for Tonga in the DataFrame is null though in the csv it would have populated as empty. We can have it retained as null by using the nullValue option.

Dataframe Write With Nullvalue

With this option, we observe that null is retained.

emptyValue Option:

In some scenarios we may need to populate null for empty values, in that case we can use the below option.

Dataframe Write With Emptyvalue

From the output above, we observe that Denmark previously had an empty value populated for its capital, but it is now being populated with null.

ignoreLeadingWhiteSpaces and ignoreTrailingWhiteSpaces Option:

If we need to retain the spaces before or after the value in a column, we can use the below options.

Dataframe Write With Ignoreleadtrail

Different Way to use Multiple Options:

We can have all the options for the file format in a common variable and then use it whenever needed if we have to use the same set of options for multiple files.

Dataframe Write With Multile Options

We have created a variable writeOptions of Map type which has the options stored within it and we can use it whenever we need that Output Option.

JSON Format:

We can use the below syntax and format to write into a JSON file from the DataFrame.

Dataframe Write With Json

Other Formats:

ORC Format:

Below is the syntax for writing the DataFrame in ORC Format:

df.write.mode("overwrite").orc("dbfs:/FileStore/df_write/")

Parquet Format:

Below is the syntax for writing the DataFrame in ORC Format:

df.write.mode("overwrite").parquet("dbfs:/FileStore/df_write/")

Similar to the above there are several more formats and examples along with syntaxes which you can reference from the below links.

In this blog post, we covered the basics of writing Spark DataFrame into different file formats. Depending on your specific requirements and use cases, you can choose the appropriate file format and configuration options to optimize performance and compatibility.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Gowtham Ramadoss Baskaran

Gowtham holds the role of Technical Consultant at Perficient, specializing as a Databricks Spark Developer. He is proficient in technologies like SQL, Databricks, Spark, Scala, and Java, so he actively pursues new knowledge to bolster his productivity. He works diligently in various roles to contribute and give back to the community.

More from this Author

Follow Us