战雄电竞今日赛事

Pandas was developed at hedge fund AQR by  Wes McKinney  to enable quick analysis of financial data. Pandas is an extension of NumPy that supports vectorized operations enabling fast manipulation of financial information.

I’ll be using company data provided by an  Intrinio developer sandbox . If you want to follow along, you can find the code and the data on the  Analyzing Alpha Github Repo .

													
													import pandas as pd import numpy as np import urllib.request url= 'https://raw.githubusercontent.com/leosmigel ... /analyzingalpha/master/2019-09-30-data-manipulation-with-python/companies.csv' with urllib.request.urlopen(url) as f: companies = pd.read_csv(f, index_col='id') companies[['name', 'sector']].head()
												
											
												
												name sector id 1 Apple Inc Consumer Goods 2 American Express Co Financial 3 Boeing Co Industrial Goods 4 Caterpillar Inc Industrial Goods 5 Cisco Systems Inc Technology
											
										

Making Your Data Tidy

While it’s not needed for these simple examples, I want to introduce Tidy Data. When working with large datasets, and especially for factor analysis, you’ll want to make your life easier and tidy your dataset using  pandas.melt . It makes the data easier to analyze and usually more performant. For those who are interested, there’s a great article on medium on  Tidying Dataset in Python .

String Operations in Pandas

Often, we’ll need to manipulate string data that are formatted incorrectly. Pandas provides [vectorized string operations/python-for-finance#vectorization). For most string operations, you’ll need to be familiar with  regular expressions .

Method Description
pandas.Series.str.split Splits string on specified delimiter
pandas.Series.str.replace Replaces string on match of string or regex
pandas.Series.str.extract Extracts string on regex group match

Let’s perform an example extract operation by smushing some of our existing data together.

											
											companies_smushed = pd.DataFrame() companies_smushed['sector_employees'] = companies['sector'] + \ companies['employees'].astype(str) print(companies[['sector', 'employees']].head()) print(companies_smushed.head()) companies_smushed = companies_smushed['sector_employees'].str.extract(r'(\D+)(\d+)') companies_smushed.head()
										
									
										
										sector employees 0 Consumer Goods 132000 1 Financial 59000 2 Industrial Goods 153027 3 Industrial Goods 104000 4 Technology 74200 sector_employees 0 Consumer Goods 132000 1 Financial 59000 2 Industrial Goods 153027 3 Industrial Goods 104000 4 Technology 74200 0 1 0 Consumer Goods 132000 1 Financial 59000 2 Industrial Goods 153027 3 Industrial Goods 104000 4 Technology 74200
									
								

Grouping Operations: Split-Apply-Combine

Grouping operations can be broken down into three steps:

  1. Split the data into groups
  2. Apply a function to each group
  3. Combine the groups into a result

Before we use pandas to group and modify our data, let’s look at how we could accomplish counting the number of companies in each sector using python.

									
									sector_count = {} for sector in companies['sector']: if sector_count.get(sector): sector_count.update({sector: sector_count[sector] + 1}) else: sector_count.update({sector: 1}) print(sector_count)
								
							
								
								{'Consumer Goods': 4, 'Financial': 5, 'Industrial Goods': 5, 'Technology': 5, 'Basic Materials': 2, 'Services': 3, 'Healthcare': 4}
							
						

Grouping

pandas.Dataframe.groupby  gives us a better way to group data. groupby returns a DataFrameGroupBy or a SeriesGroupBy object. These objects can be thought of as a group. You can see below that sector_group.groups returns a dictionary of key/value pairs being sectors and their associated rows. See below that the financial group contains rows ([1, 6, 8, 14, 23]) respectively.

							
							sector_group = companies.groupby(by='sector') print(sector_group) print(sector_group.groups) print(sector_group.get_group('Financial'))
						
					
						
						
							{'Basic Materials': Int64Index([5, 27], dtype='int64'), 'Consumer Goods': Int64Index([0, 15, 20, 22], dtype='int64'), 'Financial': Int64Index([1, 6, 8, 14, 23], dtype='int64'), 'Healthcare': Int64Index([13, 18, 21, 24], dtype='int64'), 'Industrial Goods': Int64Index([2, 3, 7, 12, 17], dtype='int64'), 'Services': Int64Index([9, 16, 26], dtype='int64'), 'Technology': Int64Index([4, 10, 11, 19, 25], dtype='int64')} id name ... industry_group employees 1 2 American Express Co ... Credit Services 59000 6 25 Visa Inc ... Credit Services 17000 8 8 Goldman Sachs Group Inc ... Investment Brokerage - National 36600 14 13 JPMorgan Chase & Co ... Money Center Banks 256105 23 22 Travelers Companies Inc ... Property & Casualty Insurance 30400 [5 rows x 7 columns]
						
					

To show that the groupby object is just a blueprint, we can replicate get_group by using iloc.

						
						print(companies.iloc[[1,6,8,14,23],[1,5,6]])
					
				
					
					name industry_group employees 1 American Express Co Credit Services 59000 6 Visa Inc Credit Services 17000 8 Goldman Sachs Group Inc Investment Brokerage - National 36600 14 JPMorgan Chase & Co Money Center Banks 256105 23 Travelers Companies Inc Property & Casualty Insurance 30400
				
			

In summary, groupby creates a blueprint that enables us to run many useful operations on the group. Using a groupby object is efficient as it allows us to have a one-to-many relationship in regard to calculating group values.

Alternatively, we can use the power of Pandas and use boolean indexing and an aggregation method to return the number of companies in each sector.

Aggregation

Aggregation takes the values and returns a value of a lesser dimension. For example, a function that takes a series and returns a single scalar value. Every GroupBy aggregation either implicitly or explicitly has a grouping column, an aggregation column, and a function column. Common aggregation functions are shown below:

Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values
				
				print(sector_group.size()) print(sector_group.size().sum())
			
		
			
			sector Basic Materials 2 Consumer Goods 4 Financial 5 Healthcare 4 Industrial Goods 5 Services 3 Technology 5 dtype: int64 28
		
	

We can also group by multiple columns. Notice that the aggregate function was called on the employees column automatically as it’s the only number-type column.

		
		companies.groupby(['sector', 'industry_group'])['employees'].sum()
	
		
		sector industry_group Basic Materials Major Integrated Oil & Gas 119600 Consumer Goods Beverages - Soft Drinks 62600 Electronic Equipment 132000 Personal Products 92000 Textile - Apparel Footwear & Accessories 73100 Financial Credit Services 76000 Investment Brokerage - National 36600 Money Center Banks 256105 Property & Casualty Insurance 30400 Healthcare Drug Manufacturers - Major 296500 Health Care Plans 300000 Industrial Goods Aerospace/Defense Products & Services 393027 Diversified Machinery 376516 Farm & Construction Machinery 104000 Services Discount, Variety Stores 2200000 Home Improvement Stores 413000 Restaurants 210000 Technology Business Software & Services 131000 Information Technology Services 350600 Networking & Communication Devices 74200 Semiconductor - Broad Line 107400 Telecom Services - Domestic 144500 Name: employees, dtype: int64
	

pandas.core.groupby.DataFrame.agg  allows us to perform multiple aggregations at once including user-defined aggregations.

		
		def half(column): return column.sum() / 2 def total(column): return column.sum() companies.groupby(['sector']).agg({'employees': [np.min, half, total]})
	
		
		employees amin half total sector Basic Materials 48600 59800.0 119600 Consumer Goods 62600 179850.0 359700 Financial 17000 199552.5 399105 Healthcare 69000 298250.0 596500 Industrial Goods 93516 436771.5 873543 Services 210000 1411500.0 2823000 Technology 74200 403850.0 807700
	

Transformation

Transformation returns manipulated data in the same size as the passed data. There are many methods and functions at your disposal to transform data, as shown below.

Method / Function Description
Series.map Substitute each value for another
Series.apply Invoke a function elementwise on a series
DataFrame.applymap Apply a function elementwise to a dataframe
DataFrame.apply Invoke a function on each column or row
Series.transform Invoke a function returning series of transformed values
DataFrame.transform Invoke a function returning dataframe of transformed values
pandas.melt Reshape a dataframe. Useful for graphing, vectorized operations, and tidying data.
pandas.pivot Reshape a dataframe. Essentially an unmelt operation.

In the below example, we pass a defined function find_percent and return the transformed series.

		
		def find_percent(column): return column / float(column.sum()) companies.groupby('sector').agg({'employees': 'sum'}).transform(find_percent)
	
		
		employees sector Basic Materials 0.020003 Consumer Goods 0.060159 Financial 0.066749 Healthcare 0.099763 Industrial Goods 0.146098 Services 0.472141 Technology 0.135086
	

As with agg, we can pass multiple functions, including lambda functions.

		
		companies.groupby('sector').agg({'employees':'sum'}).transform([lambda x: x / x.sum()])
	
		
		employees
		
			sector Basic Materials 0.020003 Consumer Goods 0.060159 Financial 0.066749 Healthcare 0.099763 Industrial Goods 0.146098 Services 0.472141 Technology 0.135086
		
	

While we can’t pass multiple functions to apply as we can with transform, we can access other columns using apply where we are limited with transform. Again, it’s best to know what’s available to you and use the best tool for the job at that current moment.

		
		companies.groupby('sector').apply(lambda x: x['employees'] * 2)
	
		
		sector Basic Materials 5 97200 27 142000 Consumer Goods 0 264000 15 125200 20 146200 22 184000 Financial 1 118000 6 34000 8 73200 14 512210 23 60800 Healthcare 13 270200 18 138000 21 184800 24 600000 Industrial Goods 2 306054 3 208000 7 566000 12 480000 17 187032 Services 9 826000 16 420000 26 4400000 Technology 4 148400 10 701200 11 214800 19 262000 25 289000 Name: employees, dtype: int64
	

Filtration

Filter returns a subset of the original data. It works similar to boolean indexing except instead of working on individual rows, it works on individual groups. Filter must return a True or False value for the group as a whole. The services sector is the only sector that has more than 1,000,000 employees.

		
		companies.groupby('sector').filter( lambda x: x['employees'].sum() > 1000000 )[['name', 'employees']]
	
		
		name employees id 9 Home Depot Inc 413000 15 McDonalds Corp 210000 27 Walmart Inc 2200000
	

Pivot Tables

We can produce the same data in a different format by using pandas.pivot_table.

		
		companies.pivot_table(columns='sector', values='employees', aggfunc='sum')
	
		
		sector Basic Materials Consumer Goods ... employees 119600 359700
	

Joining Data

We can quickly join two dataframes through join, merge, and concat. Merge is the underlying function for all join/merge behavior. Join has slightly different defaults and is provided as a convenience method.

Method / Function Description
pandas.DataFrame.join Join dataframe on on index, or key column
pandas.DataFrame.merge Merge dataframe using database-style joins
pandas.concat Concatenate dataframes along a particular axis

Let’s see an example in action by bringing in the security data and joining on the id column.

		
		import pandas as pd import numpy as np import urllib.request url= 'https://raw.githubusercontent.com/leosmigel ... /analyzingalpha/master/data-manipulation-with-python/securities.csv' with urllib.request.urlopen(url) as f: securities = pd.read_csv(f, index_col='id') print(securities.head()) securities_companies = companies.join(securities) securities_companies[['name', 'ticker', 'figi']].head(7)
	
		
		ticker currency figi id 1 AAPL USD BBG000B9Y5X2 2 AXP USD BBG000BCR153 3 BA USD BBG000BCSV38 4 CAT USD BBG000BF0LJ6 5 CSCO USD BBG000C3JBN9
	
		
		name ticker figi id 1 Apple Inc AAPL BBG000B9Y5X2 2 American Express Co AXP BBG000BCR153 3 Boeing Co BA BBG000BCSV38 4 Caterpillar Inc CAT BBG000BF0LJ6 5 Cisco Systems Inc CSCO BBG000C3JBN9 6 Chevron Corp CVX BBG000K4NHJ5 25 Visa Inc V BBG000PSL0X0
	

CCG电子竞技比赛竞猜app下载 365电竞观看全球 英雄联盟竞猜比分入口 fifa电竞(长春)手游外围1.4.10 英雄联盟竞猜数据抽注 电竞体育(武汉)观看全球网址