SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • July 21, 2023
  • Rss Fetcher

Explaining the pandas data model and its advantages

Photo by La-Rel Easter on Unsplash

Introduction

pandas enables you to choose between different types of arrays to represent the data of your dataframe. Historically, most dataframes are backed by NumPy arrays. pandas 2.0 introduced the option to use PyArrow arrays as a storage format. Additionally, an intermediate layer exists between these arrays and your dataframe, Block, and the BlockManager. We will take a look at how this layer orchestrates the different arrays, basically, what’s behind pd.DataFrame(). We will try to answer all questions you might have about pandas internals.

The post will introduce some terminology necessary to understand how Copy-on-Write works, which I’ll write about next.

I am a member of the pandas core team working on the internals, among other things. I am currently working at Coiled where I am focusing on Dask.

pandas data structure

A dataframe is usually backed by some array, e.g., a NumPy array or pandas ExtensionArray. These arrays store the data of the dataframe. pandas adds an intermediate layer called Block and BlockManager that orchestrate these arrays to make operations as efficient as possible. This is one reason why methods operating on multiple columns can be very fast in pandas. Let’s look more into the details of these layers.

Arrays

The actual data of a dataframe can be stored in a set of NumPy arrays or pandas ExtensionArrays. This layer generally dispatches to the underlying implementation, e.g., it will utilize the NumPy API if the data is stored in NumPy arrays. pandas stores the data in them and calls its methods without enriching the interface. You can read up on pandas ExtensionArrays here.

NumPy arrays are normally two-dimensional, offering a bunch of performance advantages that we will look at later. pandas ExtensionArrays are mostly one-dimensional data structures as of right now. This makes things more predictable but has some drawbacks when looking at performance in a specific set of operations.

ExtensionArrays enable dataframes that are backed by PyArrow arrays, among other dtypes.

Blocks

A dataframe normally consists of columns represented by at least one array. Normally, you’ll have a collection of arrays since one array can only store one specific dtype. These arrays store your data but don’t have any information about which columns they represent. Every array from your dataframe is wrapped by one corresponding Block. Blocks add some additional information to these arrays, like the column locations represented by this Block. Blocks serve as a layer around the actual arrays that can be enriched with utility methods necessary for pandas operations.

When an actual operation is executed on a dataframe, the Block ensures that the method dispatches to the underlying array, e.g., if you call astype, it will make sure that this operation is called on the array.

This layer has no information about the other columns in the dataframe. It is a stand-alone object.

BlockManager

As the name suggests, the BlockManager orchestrates all Blocks that are connected to one dataframe. It holds the Blocks itself and information about your dataframe’s axes, e.g., column names and Index labels. Most importantly, it dispatches most operations to the actual Blocks.

df.replace(...)

The BlockManager ensures that replace is executed on every Block.

What is a consolidated dataframe?

We are assuming that the dataframes are backed by NumPy dtypes, e.g., that its data can be stored as two-dimensional arrays.

When a dataframe is constructed, pandas mostly ensure there is only one Block per dtype.

df = pd.DataFrame(
{
"a": [1, 2, 3],
"b": [1.5, 2.5, 3.5],
"c": [10, 11, 12],
"d": [10.5, 11.5, 12.5],
}
)

This dataframe has four columns represented by two arrays: one of the arrays stores the integer dtypes while the other stores the float dtypes. This is a consolidated dataframe.

Now, let’s add a new column to this dataframe:

df["new"] = 100

This will have the same dtype as our existing column “a” and “c”. There are now two potential ways of moving forward:

  1. Add the new column to the existing array that holds the integer columns
  2. Create a new array that only stores the new column.

The first option would require adding a new column to the existing array. This would require copying the data since NumPy does not support this operation without a copy. This is a steep cost for adding one column.

The second option adds a third array to our collection of arrays. Apart from this, no additional operation is necessary. This is very cheap. We now have two Blocks that store integer data. This is a dataframe that is not consolidated.

These differences don’t matter much as long as you only operate on a per-column basis. It will impact the performance of your operations as soon as they operate on multiple columns. For example, performing any axis=1 operation will transpose the data of your dataframe. Transposing is generally zero-copy if performed on a dataframe backed by a single NumPy array. This is no longer true if every column is backed by a different array and hence, will incur performance penalties.

It will also require a copy to get all integer columns from your dataframe as a NumPy array.

df[["a", "c", "new"]].to_numpy()

This will create a copy since the results have to be stored in a single NumPy array. It returns a view on a consolidated dataframe, which makes this very cheap.

Previous versions often caused an internal consolidation for certain methods, which in turn caused unpredictable performance behavior. Methods like reset_index were triggering a completely unnecessary consolidation. These were mainly removed over the last couple of releases.

To summarize, a consolidated dataframe is generally better than an unconsolidated one, but the difference depends heavily on the type of operation you want to execute.

Conclusion

We took a brief look behind the scenes of a pandas dataframe. We learned what Blocks and BlockManagers are and how they orchestrate your dataframe. These terms will prove valuable when we look behind the scenes of Copy-on-Write.

Thank you for reading. Feel free to reach out to share your thoughts and feedback. Follow me on Medium to learn more about pandas and Dask.


Pandas Internals Explained was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Previous Post
Next Post

Recent Posts

  • Banking on a serverless world
  • Cursor’s Anysphere nabs $9.9B valuation, soars past $500M ARR
  • Circle IPO soars, giving hope to more startups waiting to go public
  • Why are Elon Musk and Donald Trump fighting?
  • Europe will have to be more Tenacious to land its first rover on the Moon

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • June 2025
  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.