Keeping Pandas DataFrames clean when importing JSON

Photo by Samuel Zeller on Unsplash

This post originally appeared on my personal website. You can view an updated version with code formatting on my blog.

At my job, I do a lot of data analysis to find the most promising young startups. As a first step, you always have to import the desired data into a Pandas DataFrame and do some preprocessing, for example by importing JSON data from some API.

When importing JSON data, you usually have a lot of temporary columns in your DataFrame which can get messy quickly. To deal with temporary columns, I built a custom Context Manager that keeps track of all imported columns and deletes them when you’re done. This way, your code stays lean and you don’t have to remove temporary columns yourself.

In this short article, I will show you how. You can find the code on GitHub.

As an example, I will use the actual code I use for importing data from the API of our CRM named Hubspot. What I retrieve is a list of companies stored as a list of Python dictionaries. To import a list of dictionaries (data) into a Pandas DataFrame you basically do:

The json_normalize function generates a clean DataFrame based on the given list of dictionaries, the data parameter, and normalizes the hierarchy so you get clean column names. This is especially useful for nested dictionaries.

Ugly: Keeping imported columns

The problem with json_normalize is that you usually only want a subset of the imported columns, mostly with different names or some kind of pre-processing, too. So you might be tempted to do something like this:

This works, but keeps all the imported columns inplace and might take a lot of storage. So what can you do?

Ugly: Dropping columns manually

So after importing, you want to get rid of all temporary columns from the import. To do this, you have to either select the columns you want or drop all columns you don’t want. In both cases, you have to somehow keep track of the temporary columns or the ones you want to keep. To deal with this, one solution would be to prefix temporary columns and delete them afterwards:

Afterwards, you would then select all desired columns or drop all undesired columns.

While this works, it feels bloated and inefficient. You have to prefix all the value names in the code which results in bloated column names. You also have to keep track of column names you want in the end or the used prefix in different places. Just imagine you have to change the prefix temp_ one day or make the code work with a different prefix.

Clean and easy: using a Context Manager

After having used the above methods for some time, it struck me that Python Context Managers might be a cleaner solution. You might know them from their most popular application with open() as file:. If not, please take a few minutes to read more about them. To make things short: They basically ensure that something, usually a cleanup, is executed in each exit scenario, whether it is a usual exit like a return or an exception. I thought I might use this to build a clean solution that keeps track and gets rid of temporary columns. So I built a Context Manager that deals with temporary columns when importing JSON data so I don't have to. You can basically use it like this:

The benefit: You don’t have to keep track anymore and the context manager handles the deletion of all temporary columns.

How it works

You can just copy and paste the following snippet to get going, I’ll explain how it works below:

When opening the context, __init__ and __enter__ get called. They create the DataFrame and remember all imported and thus temporary column names. When the context is exited, __exit__ makes sure to drop all previously created columns and leaves only the newly created columns behind.

Hope this helps you to create a clean pre-processing pipeline. Let me know what you think.

This post originally appeared on my blog.

Techie, Hacker, Millenial, founding a VC, @_firstmomentum. #venturecapital #deeptech #socialnetworkanalysis #automation #crawlers #ML #HR

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store