Explaining the pandas data model and its advantages
Introduction
pandas enables you to choose between different types of arrays to represent the data of your dataframe. Historically, most dataframes are backed by NumPy arrays. pandas 2.0 introduced the option to use PyArrow arrays as a storage format. Additionally, an intermediate layer exists between these arrays and your dataframe, Block, and the BlockManager. We will take a look at how this layer orchestrates the different arrays, basically, what’s behind pd.DataFrame(). We will try to answer all questions you might have about pandas internals.
The post will introduce some terminology necessary to understand how Copy-on-Write works, which I’ll write about next.
I am a member of the pandas core team working on the internals, among other things. I am currently working at Coiled where I am focusing on Dask.
pandas data structure
A dataframe is usually backed by some array, e.g., a NumPy array or pandas ExtensionArray. These arrays store the data of the dataframe. pandas adds an intermediate layer called Block and BlockManager that orchestrate these arrays to make operations as efficient as possible. This is one reason why methods operating on multiple columns can be very fast in pandas. Let’s look more into the details of these layers.
Arrays
The actual data of a dataframe can be stored in a set of NumPy arrays or pandas ExtensionArrays. This layer generally dispatches to the underlying implementation, e.g., it will utilize the NumPy API if the data is stored in NumPy arrays. pandas stores the data in them and calls its methods without enriching the interface. You can read up on pandas ExtensionArrays here.
NumPy arrays are normally two-dimensional, offering a bunch of performance advantages that we will look at later. pandas ExtensionArrays are mostly one-dimensional data structures as of right now. This makes things more predictable but has some drawbacks when looking at performance in a specific set of operations.
ExtensionArrays enable dataframes that are backed by PyArrow arrays, among other dtypes.
Blocks
A dataframe normally consists of columns represented by at least one array. Normally, you’ll have a collection of arrays since one array can only store one specific dtype. These arrays store your data but don’t have any information about which columns they represent. Every array from your dataframe is wrapped by one corresponding Block. Blocks add some additional information to these arrays, like the column locations represented by this Block. Blocks serve as a layer around the actual arrays that can be enriched with utility methods necessary for pandas operations.
When an actual operation is executed on a dataframe, the Block ensures that the method dispatches to the underlying array, e.g., if you call astype, it will make sure that this operation is called on the array.
This layer has no information about the other columns in the dataframe. It is a stand-alone object.
BlockManager
As the name suggests, the BlockManager orchestrates all Blocks that are connected to one dataframe. It holds the Blocks itself and information about your dataframe’s axes, e.g., column names and Index labels. Most importantly, it dispatches most operations to the actual Blocks.
df.replace(...)
The BlockManager ensures that replace is executed on every Block.
What is a consolidated dataframe?
We are assuming that the dataframes are backed by NumPy dtypes, e.g., that its data can be stored as two-dimensional arrays.
When a dataframe is constructed, pandas mostly ensure there is only one Block per dtype.
df = pd.DataFrame(
{
"a": [1, 2, 3],
"b": [1.5, 2.5, 3.5],
"c": [10, 11, 12],
"d": [10.5, 11.5, 12.5],
}
)
This dataframe has four columns represented by two arrays: one of the arrays stores the integer dtypes while the other stores the float dtypes. This is a consolidated dataframe.
Now, let’s add a new column to this dataframe:
df["new"] = 100
This will have the same dtype as our existing column “a” and “c”. There are now two potential ways of moving forward:
- Add the new column to the existing array that holds the integer columns
- Create a new array that only stores the new column.
The first option would require adding a new column to the existing array. This would require copying the data since NumPy does not support this operation without a copy. This is a steep cost for adding one column.
The second option adds a third array to our collection of arrays. Apart from this, no additional operation is necessary. This is very cheap. We now have two Blocks that store integer data. This is a dataframe that is not consolidated.
These differences don’t matter much as long as you only operate on a per-column basis. It will impact the performance of your operations as soon as they operate on multiple columns. For example, performing any axis=1 operation will transpose the data of your dataframe. Transposing is generally zero-copy if performed on a dataframe backed by a single NumPy array. This is no longer true if every column is backed by a different array and hence, will incur performance penalties.
It will also require a copy to get all integer columns from your dataframe as a NumPy array.
df[["a", "c", "new"]].to_numpy()
This will create a copy since the results have to be stored in a single NumPy array. It returns a view on a consolidated dataframe, which makes this very cheap.
Previous versions often caused an internal consolidation for certain methods, which in turn caused unpredictable performance behavior. Methods like reset_index were triggering a completely unnecessary consolidation. These were mainly removed over the last couple of releases.
To summarize, a consolidated dataframe is generally better than an unconsolidated one, but the difference depends heavily on the type of operation you want to execute.
Conclusion
We took a brief look behind the scenes of a pandas dataframe. We learned what Blocks and BlockManagers are and how they orchestrate your dataframe. These terms will prove valuable when we look behind the scenes of Copy-on-Write.
Thank you for reading. Feel free to reach out to share your thoughts and feedback. Follow me on Medium to learn more about pandas and Dask.
Pandas Internals Explained was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.