Pandas was developed at hedge fund AQR by Wes McKinney to enable quick analysis of financial data. Pandas is an extension of NumPy that supports vectorized operations enabling fast manipulation of financial information.
I’ll be using company data provided by an Intrinio developer sandbox . If you want to follow along, you can find the code and the data on the Analyzing Alpha Github Repo .
import pandas as pd import numpy as np import urllib.request url= 'https://raw.githubusercontent.com/leosmigel ... /analyzingalpha/master/2019-09-30-data-manipulation-with-python/companies.csv' with urllib.request.urlopen(url) as f: companies = pd.read_csv(f, index_col='id') companies[['name', 'sector']].head()
name sector id 1 Apple Inc Consumer Goods 2 American Express Co Financial 3 Boeing Co Industrial Goods 4 Caterpillar Inc Industrial Goods 5 Cisco Systems Inc Technology
Making Your Data Tidy
While it’s not needed for these simple examples, I want to introduce Tidy Data. When working with large datasets, and especially for factor analysis, you’ll want to make your life easier and tidy your dataset using pandas.melt . It makes the data easier to analyze and usually more performant. For those who are interested, there’s a great article on medium on Tidying Dataset in Python .
String Operations in Pandas
Often, we’ll need to manipulate string data that are formatted incorrectly. Pandas provides [vectorized string operations/python-for-finance#vectorization). For most string operations, you’ll need to be familiar with regular expressions .
Method | Description |
---|---|
pandas.Series.str.split | Splits string on specified delimiter |
pandas.Series.str.replace | Replaces string on match of string or regex |
pandas.Series.str.extract | Extracts string on regex group match |
Let’s perform an example extract operation by smushing some of our existing data together.
companies_smushed = pd.DataFrame() companies_smushed['sector_employees'] = companies['sector'] + \ companies['employees'].astype(str) print(companies[['sector', 'employees']].head()) print(companies_smushed.head()) companies_smushed = companies_smushed['sector_employees'].str.extract(r'(\D+)(\d+)') companies_smushed.head()
sector employees 0 Consumer Goods 132000 1 Financial 59000 2 Industrial Goods 153027 3 Industrial Goods 104000 4 Technology 74200 sector_employees 0 Consumer Goods 132000 1 Financial 59000 2 Industrial Goods 153027 3 Industrial Goods 104000 4 Technology 74200 0 1 0 Consumer Goods 132000 1 Financial 59000 2 Industrial Goods 153027 3 Industrial Goods 104000 4 Technology 74200
Grouping Operations: Split-Apply-Combine
Grouping operations can be broken down into three steps:
- Split the data into groups
- Apply a function to each group
- Combine the groups into a result
Before we use pandas to group and modify our data, let’s look at how we could accomplish counting the number of companies in each sector using python.
sector_count = {} for sector in companies['sector']: if sector_count.get(sector): sector_count.update({sector: sector_count[sector] + 1}) else: sector_count.update({sector: 1}) print(sector_count)
{'Consumer Goods': 4, 'Financial': 5, 'Industrial Goods': 5, 'Technology': 5, 'Basic Materials': 2, 'Services': 3, 'Healthcare': 4}
Grouping
pandas.Dataframe.groupby gives us a better way to group data. groupby returns a DataFrameGroupBy or a SeriesGroupBy object. These objects can be thought of as a group. You can see below that sector_group.groups returns a dictionary of key/value pairs being sectors and their associated rows. See below that the financial group contains rows ([1, 6, 8, 14, 23]) respectively.
sector_group = companies.groupby(by='sector') print(sector_group) print(sector_group.groups) print(sector_group.get_group('Financial'))
{'Basic Materials': Int64Index([5, 27], dtype='int64'), 'Consumer Goods': Int64Index([0, 15, 20, 22], dtype='int64'), 'Financial': Int64Index([1, 6, 8, 14, 23], dtype='int64'), 'Healthcare': Int64Index([13, 18, 21, 24], dtype='int64'), 'Industrial Goods': Int64Index([2, 3, 7, 12, 17], dtype='int64'), 'Services': Int64Index([9, 16, 26], dtype='int64'), 'Technology': Int64Index([4, 10, 11, 19, 25], dtype='int64')} id name ... industry_group employees 1 2 American Express Co ... Credit Services 59000 6 25 Visa Inc ... Credit Services 17000 8 8 Goldman Sachs Group Inc ... Investment Brokerage - National 36600 14 13 JPMorgan Chase & Co ... Money Center Banks 256105 23 22 Travelers Companies Inc ... Property & Casualty Insurance 30400 [5 rows x 7 columns]
To show that the groupby object is just a blueprint, we can replicate get_group by using iloc.
print(companies.iloc[[1,6,8,14,23],[1,5,6]])
name industry_group employees 1 American Express Co Credit Services 59000 6 Visa Inc Credit Services 17000 8 Goldman Sachs Group Inc Investment Brokerage - National 36600 14 JPMorgan Chase & Co Money Center Banks 256105 23 Travelers Companies Inc Property & Casualty Insurance 30400
In summary, groupby creates a blueprint that enables us to run many useful operations on the group. Using a groupby object is efficient as it allows us to have a one-to-many relationship in regard to calculating group values.
Alternatively, we can use the power of Pandas and use boolean indexing and an aggregation method to return the number of companies in each sector.
Aggregation
Aggregation takes the values and returns a value of a lesser dimension. For example, a function that takes a series and returns a single scalar value. Every GroupBy aggregation either implicitly or explicitly has a grouping column, an aggregation column, and a function column. Common aggregation functions are shown below:
Function | Description |
---|---|
mean() | Compute mean of groups |
sum() | Compute sum of group values |
size() | Compute group sizes |
count() | Compute count of group |
std() | Standard deviation of groups |
var() | Compute variance of groups |
sem() | Standard error of the mean of groups |
describe() | Generates descriptive statistics |
first() | Compute first of group values |
last() | Compute last of group values |
nth() | Take nth value, or a subset if n is a list |
min() | Compute min of group values |
max() | Compute max of group values |
print(sector_group.size()) print(sector_group.size().sum())
sector Basic Materials 2 Consumer Goods 4 Financial 5 Healthcare 4 Industrial Goods 5 Services 3 Technology 5 dtype: int64 28
We can also group by multiple columns. Notice that the aggregate function was called on the employees column automatically as it’s the only number-type column.
companies.groupby(['sector', 'industry_group'])['employees'].sum()
sector industry_group Basic Materials Major Integrated Oil & Gas 119600 Consumer Goods Beverages - Soft Drinks 62600 Electronic Equipment 132000 Personal Products 92000 Textile - Apparel Footwear & Accessories 73100 Financial Credit Services 76000 Investment Brokerage - National 36600 Money Center Banks 256105 Property & Casualty Insurance 30400 Healthcare Drug Manufacturers - Major 296500 Health Care Plans 300000 Industrial Goods Aerospace/Defense Products & Services 393027 Diversified Machinery 376516 Farm & Construction Machinery 104000 Services Discount, Variety Stores 2200000 Home Improvement Stores 413000 Restaurants 210000 Technology Business Software & Services 131000 Information Technology Services 350600 Networking & Communication Devices 74200 Semiconductor - Broad Line 107400 Telecom Services - Domestic 144500 Name: employees, dtype: int64
pandas.core.groupby.DataFrame.agg allows us to perform multiple aggregations at once including user-defined aggregations.
def half(column): return column.sum() / 2 def total(column): return column.sum() companies.groupby(['sector']).agg({'employees': [np.min, half, total]})
employees amin half total sector Basic Materials 48600 59800.0 119600 Consumer Goods 62600 179850.0 359700 Financial 17000 199552.5 399105 Healthcare 69000 298250.0 596500 Industrial Goods 93516 436771.5 873543 Services 210000 1411500.0 2823000 Technology 74200 403850.0 807700
Transformation
Transformation returns manipulated data in the same size as the passed data. There are many methods and functions at your disposal to transform data, as shown below.
Method / Function | Description |
---|---|
Series.map | Substitute each value for another |
Series.apply | Invoke a function elementwise on a series |
DataFrame.applymap | Apply a function elementwise to a dataframe |
DataFrame.apply | Invoke a function on each column or row |
Series.transform | Invoke a function returning series of transformed values |
DataFrame.transform | Invoke a function returning dataframe of transformed values |
pandas.melt | Reshape a dataframe. Useful for graphing, vectorized operations, and tidying data. |
pandas.pivot | Reshape a dataframe. Essentially an unmelt operation. |
In the below example, we pass a defined function find_percent and return the transformed series.
def find_percent(column): return column / float(column.sum()) companies.groupby('sector').agg({'employees': 'sum'}).transform(find_percent)
employees sector Basic Materials 0.020003 Consumer Goods 0.060159 Financial 0.066749 Healthcare 0.099763 Industrial Goods 0.146098 Services 0.472141 Technology 0.135086
As with agg, we can pass multiple functions, including lambda functions.
companies.groupby('sector').agg({'employees':'sum'}).transform([lambda x: x / x.sum()])
employees
sector Basic Materials 0.020003 Consumer Goods 0.060159 Financial 0.066749 Healthcare 0.099763 Industrial Goods 0.146098 Services 0.472141 Technology 0.135086
While we can’t pass multiple functions to apply as we can with transform, we can access other columns using apply where we are limited with transform. Again, it’s best to know what’s available to you and use the best tool for the job at that current moment.
companies.groupby('sector').apply(lambda x: x['employees'] * 2)
sector Basic Materials 5 97200 27 142000 Consumer Goods 0 264000 15 125200 20 146200 22 184000 Financial 1 118000 6 34000 8 73200 14 512210 23 60800 Healthcare 13 270200 18 138000 21 184800 24 600000 Industrial Goods 2 306054 3 208000 7 566000 12 480000 17 187032 Services 9 826000 16 420000 26 4400000 Technology 4 148400 10 701200 11 214800 19 262000 25 289000 Name: employees, dtype: int64
Filtration
Filter returns a subset of the original data. It works similar to boolean indexing except instead of working on individual rows, it works on individual groups. Filter must return a True or False value for the group as a whole. The services sector is the only sector that has more than 1,000,000 employees.
companies.groupby('sector').filter( lambda x: x['employees'].sum() > 1000000 )[['name', 'employees']]
name employees id 9 Home Depot Inc 413000 15 McDonalds Corp 210000 27 Walmart Inc 2200000
Pivot Tables
We can produce the same data in a different format by using pandas.pivot_table.
companies.pivot_table(columns='sector', values='employees', aggfunc='sum')
sector Basic Materials Consumer Goods ... employees 119600 359700
Joining Data
We can quickly join two dataframes through join, merge, and concat. Merge is the underlying function for all join/merge behavior. Join has slightly different defaults and is provided as a convenience method.
Method / Function | Description |
---|---|
pandas.DataFrame.join | Join dataframe on on index, or key column |
pandas.DataFrame.merge | Merge dataframe using database-style joins |
pandas.concat | Concatenate dataframes along a particular axis |
Let’s see an example in action by bringing in the security data and joining on the id column.
import pandas as pd import numpy as np import urllib.request url= 'https://raw.githubusercontent.com/leosmigel ... /analyzingalpha/master/data-manipulation-with-python/securities.csv' with urllib.request.urlopen(url) as f: securities = pd.read_csv(f, index_col='id') print(securities.head()) securities_companies = companies.join(securities) securities_companies[['name', 'ticker', 'figi']].head(7)
ticker currency figi id 1 AAPL USD BBG000B9Y5X2 2 AXP USD BBG000BCR153 3 BA USD BBG000BCSV38 4 CAT USD BBG000BF0LJ6 5 CSCO USD BBG000C3JBN9
name ticker figi id 1 Apple Inc AAPL BBG000B9Y5X2 2 American Express Co AXP BBG000BCR153 3 Boeing Co BA BBG000BCSV38 4 Caterpillar Inc CAT BBG000BF0LJ6 5 Cisco Systems Inc CSCO BBG000C3JBN9 6 Chevron Corp CVX BBG000K4NHJ5 25 Visa Inc V BBG000PSL0X0