Uncle Data

A curated newsletter bridging the worlds of data science and leadership. Delve into topics like data modeling, the journey of data analysts, and the intricacies of data engineering. Subscribe for in-depth insights and expert discussions.

Follow publication

Member-only story

PySpark: Avoiding Explode method.

In this post, I’ll share my experience with Spark function explode and one case where I’m happy that I avoided using it and created a faster approach to a particular use case.

Disclaimer: I’m not saying that there is always a way out of using explode and expanding data set size in memory. But I have a feeling, that it’s like 99% of use cases can be figured out and done properly without the explode method (I know, I know… going with gut feeling is not that professional, but I haven’t encountered one case yet, where I would not think of an alternative to explode).

So what is this spark function Explode:

The screenshot is taken from https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

Basically we create multiple rows of almost identical information, but one column has values split per row.

Mostly if you’re working with structured data you probably won’t think of using this method. But some time ago I was doing some processing on NoSQL type documents. My issue was to extract particular IDs from a double nested document (i.e. one column was a separate array of JSON with nested information inside in similar matter…). For this not to become a rant how I hate dealing with these kinds of structures I’ll just move along and get to the point of my task.

Basically I needed to filter the first array on a particular column and do additional filtering on the underlying JSON there… Haven’t worked with such nested structure I thought — it’s JSON, I’ll just simply call it… and then it hit me, I have no clue which array member will have this information for me to call it. Googled a bit and saw explode. The performance was poor (the dataset was huge) and I had to fine-tune it. The next part will be exactly what I did and how got rid of explode method.

Let’s do a hypothetical example. For this, I’ve created a mock JSON file using this link. The structure defined by me (very simplified variation of my task for the even simpler task. BTW: id field as a unique identifier):

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Uncle Data
Uncle Data

Published in Uncle Data

A curated newsletter bridging the worlds of data science and leadership. Delve into topics like data modeling, the journey of data analysts, and the intricacies of data engineering. Subscribe for in-depth insights and expert discussions.

Tomas Peluritis
Tomas Peluritis

Written by Tomas Peluritis

Professional Data Wizard— Data Engineering/DWH/ETL/BI/Data Science.

Responses (1)

Write a response