Jacob Danner

Intro

“Pandas: a Foundational Python Library for Data Analysis and Statistics” is a 2011 paper, published by Wes McKinney - the creator of pandas. The author begins by outlining the problem that pandas seeks to address; Python’s use in scientific computing grew rapidly, but adoption of Python for applied statistical modeling lagged behind. He claimed that there wasn't any cohesive framework to fill that gap. So he made one. Python didn't have a suitable framework, but that doesn't mean that there weren't other tools - the R language, SAS, and SQL all exist within the same problem realm as pandas. What makes pandas different? In the authors words, the goal of pandas is to "close the gap in the richness of available data analysis tools between Python, a general purpose systems and scientific computing language, and the numerous domainspecific statistical computing platforms and database languages." In my own interpertation, pandas exists to be a better, more comprehensive tool for the job than any competing technolgoy. This paper serves as an overview of pandas functionality.

DataFrames

At the core of pandas is the task of working with labeled data sets. The name pandas comes from PANel DAta - a common term for multidimensional datasets. Understandable to a layperson as an Excel Spreadsheet, or a 2d list of observations and labels - rows and columns. Panda's implementation uses a DataFrame object to represent such data. The DataFrame borrows many conceptual ideas from R's data.frame class. However a DataFrame has more features and enhancements built in, which allows for more functionality.

Index Object

Panda's Index object underlies every pandas datastructure, and is what gives pandas much of its functionality. The Index stores metadata about a datastructure. They store an array of labels, as well as the datatype of a label. For example a table with 3 rows could have an index like:

Index(['row1', 'row2', 'row3'], dtype='object')

If you wanted the rows labelled differently, say with an int, this would work too:

Index([1, 2, 3], dtype='int')

Under the hood, the Index object is what's responsible for all the powerful data selection methods pandas offers!

Another powerful feature of pandas is called MultiIndex. This allows rows to have nested indexes. In practice, this could be extremely useful. Imagine you have a dataset of cars. The data has many different models from different makers. Each row could be indexed by the model name. But what if you wanted to group models by catergory (truck, suv, sedan, etc.)? The selecting could get messy. If you were to use MultiIndex-ing, you could give each row 2 labels - catergory and model name. That way you could select data['suv'] to grab all the suvs, while maintaing each individual cars model name. This allows for much cleaner operations when there is data that makes sense to be binned together.

Putting the Index to use

Because DataFrames have Index objects, you are able to to get or set data using a "matrix like way using labels". This means you can use "A list or array of labels or integers, a slice, either with integers (e.g. 1:5) or labels (e.g. lab1:lab2), a boolean vector, or a single label". The paper writes about the dataframe.ix() method, which uses the previously mentioned selectors to access the requested rows and columns. Since then, the .ix() indexer has been deprecated. Today, .iloc() and .loc() are used to perform the task. They allow for more explicit ways of performing essentially the same task.

Data Alignment

The next section discusses how data alignment works in pandas. Say for example you have 2 datasets. Both use dates as indexes to a given row. There is some crossover between dates, but there are dates in one set that don't exist in the other. Maybe the dates aren't in the same order. What would happen if you want to join them? Pandas takes care of this.

Dataset1 + Dataset2 = union(Dataset1, Dataset2)

By using the union of the datasets, pandas combines the indexes that match, but leave the indexes that don't. This gives you lots of freedom to combine datasets where indexes don't match up perfectly. This means that you don't need to do extra work to make sure datasets match up before joining. This works on both rows and columns, and is a feature (at least in 2011) that is unqiue to pandas. This dynamic, flexible way of joining, comes with the fact that there will be null values for some fields if the datasets don't overlap perfectly. Pandas uses NaN to fill these fields. The api provides many functions for doing common operations with missing datapoints.

...And Beyond

Beyond unique features, pandas also implements good ideas borrowed from other tools. Such as easily creating pivot tables similarly to how they work in Excel. Getting a summarized version of a larger table can be very useful for data analysis purposes. Pandas .join() operates like an SQL join. You provide what index to merge on, and the tables will be joined.

If I were to summarize the philosophy behind pandas, I would say pandas is designed to be extremely flexible, intuitive, and provide useful tools to make the programmer's life easier. Due to all the nice features, there is some technical overhead, so it won't be as fast as NumPy, but it isn't trying to be. It is built on top of NumPy to provide a coherent framework that is easy to use and powerfully flexible. The author ends the article by writing, "By designing robust, easyto-use data structures that cohere with the rest of the scientific Python stack, we can make Python a compelling choice for data analysis applications. In our opinion, pandas provides a solid foundation upon which a very powerful data analysis ecosystem can be established." Being from the future, I can tell 2011 Wes McKinney he was right!

Source Paper