Pandas cheatsheet

This cheatsheet lists out important and most used pandas functionality.

%%html
<style>
.dataframe th {
    font-size: 10.5px;
    color: #3b3b3b;
}
.dataframe td {
    font-size: 10.5px;
    color: #3b3b3b;
}
</style>
import pandas as pd

# Setting optimal options for cheatsheet.
pd.set_option('display.max_rows', 10)

Input & Output

Pandas input and output functions.

Function

Description

pd.read_csv()

reads data from csv file

pd.to_csv()

writes data to csv file

pd.read_sql

reads from sql

Snippets

Reading from mysql

from sqlalchemy import create_engine
import pymysql

db_connection_str = 'mysql+pymysql://user_name:password@mysqlhost/database_name'
db_connection = create_engine(db_connection_str)

df = pd.read_sql('SELECT * FROM customers', con=db_connection)

Selecting & Indexing

Code

Description

df['col']

Selects a single column

df[['col1', 'col2']]

Selects multiple columns. Pass in a list of columns.

df.loc[index]

selects single row

df.loc[[index1,index2], [col1, col2]]

selects specified rows and columns

df.loc[start_index:end_index, start_col:end_col]

slices rows and slices columns (end value is included unlike Python). Standard slicing is supported e.g. dfl.loc['index':,::-1]

df.sort_index()

sorts row based on index

df.iloc[]

selects by position not label

df.index[position]

gives index for a position

df.index[[pos1, pos2, .., posn]]

gives index for requested positions

df.set_index[[col, col]]

Ability to set single or multi-index.

df.sample()

gives back a random sample. You can pass in the count or the fraction.

df.columns.get_loc('col_name')

returns location for a particular column

df.columns.get_indexer(['col_name', 'col_name'])

returns location for a particular columns

| & ~

used for boolean indexing

series.isin(iterable)

returns True if value is in iterable

series/df.where(filter_condition, default value)

returns the same shape as original df and not a subset like boolean indexing

df.query('expression')

querys based on string experssion

df.duplicated([col, col])

duplicacy checks

df.drop_duplicates([col, col], keep='first')

drop duplicate where keep can be last or first or False

series.get(index, default=___)

dict like get method

Snippets

Using conditionals on a single row to filter out columns. The below example would work on row which has all integers

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,4), index=list('abcdef'), columns=list('ABCD'))

df1
A B C D
a 0.786073 0.241026 1.836268 1.130302
b 0.440493 1.617552 -0.068282 -0.022670
c -1.559774 0.603044 -1.606989 0.814513
d -0.091140 -0.616634 -0.000725 0.047443
e 0.274607 0.033709 -0.676457 -0.480869
f 0.273721 -0.947775 0.773054 1.152777
# Creates a mask for a single row to filter out the cols ==> df1.loc['a'] > 0

df1.loc[:, df1.loc['a'] > 0]
A B C D
a 0.786073 0.241026 1.836268 1.130302
b 0.440493 1.617552 -0.068282 -0.022670
c -1.559774 0.603044 -1.606989 0.814513
d -0.091140 -0.616634 -0.000725 0.047443
e 0.274607 0.033709 -0.676457 -0.480869
f 0.273721 -0.947775 0.773054 1.152777

Using functions’s for selection

The function should take one argument.

Selecting rows from DataFrame where the column base_salary has value greater than 50K.

import pandas as pd

emp = pd.read_csv('data/employee.csv', index_col='unique_id')
emp
position_title department base_salary gender employment_status hire_date job_date
unique_id
10 ASSISTANT DIRECTOR (EX LVL) Municipal Courts Department 121862.0 Female Active 2006-06-12 2012-10-13
20 LIBRARY ASSISTANT Library 26125.0 Female Active 2000-07-19 2010-09-18
30 POLICE OFFICER Houston Police Department-HPD 45279.0 Male Active 2015-02-03 2015-02-03
40 ENGINEER/OPERATOR Houston Fire Department (HFD) 63166.0 Male Active 1982-02-08 1991-05-25
50 ELECTRICIAN General Services Department 56347.0 Male Active 1989-06-19 1994-10-22
... ... ... ... ... ... ... ...
19960 POLICE OFFICER Houston Police Department-HPD 43443.0 Male Active 2014-06-09 2015-06-09
19970 COMMUNICATIONS CAPTAIN Houston Fire Department (HFD) 66523.0 Male Active 2003-09-02 2013-10-06
19980 POLICE OFFICER Houston Police Department-HPD 43443.0 Male Active 2014-10-13 2015-10-13
19990 POLICE OFFICER Houston Police Department-HPD 55461.0 Male Active 2009-01-20 2011-07-02
20000 FIRE FIGHTER Houston Fire Department (HFD) 51194.0 Male Active 2009-01-12 2010-07-12

2000 rows × 7 columns

emp.loc[lambda x: emp['base_salary'] > 150000]
position_title department base_salary gender employment_status hire_date job_date
unique_id
120 CHIEF PHYSICIAN,MD Health & Human Services 180416.0 Male Active 1987-05-22 1999-08-28
440 ASSOCIATE EMS PHYSICIAN DIRECTOR Houston Fire Department (HFD) 165216.0 Male Active 2013-08-31 2013-08-31
1790 ASSOCIATE EMS PHYSICIAN DIRECTOR,MD Houston Fire Department (HFD) 210588.0 Male Active 2001-09-05 2003-05-10
5940 CITY ATTORNEY Legal Department 275000.0 Male Active 2016-05-02 2016-05-02
6060 DEPUTY DIRECTOR (EXECUTIVE LEVEL) Houston Airport System (HAS) 150416.0 Female Active 2013-06-10 2013-06-10
7200 DEPUTY DIRECTOR (EXECUTIVE LEVEL) Houston Police Department-HPD 163228.0 Male Active 1977-07-20 1998-08-08
7430 SENIOR ASSISTANT CITY ATTORNEY II Legal Department 150416.0 Female Active 2012-06-11 2013-07-06
7680 DEPUTY DIRECTOR-FINANCE & ADMINISTRATION Houston Police Department-HPD 199596.0 Male Active 2007-02-01 2007-02-01
8560 DEPUTY DIRECTOR-AVIATION (EX LVL) Houston Airport System (HAS) 186192.0 Male Active 2013-08-22 2014-08-16
12330 DEPUTY DIRECTOR-PUBLIC WORKS (EXECUTIVE Public Works & Engineering-PWE 178331.0 Female Active 1991-02-12 2005-05-14

Combine positional and label to select data

In the below example df.index[0:10:4] provides the index for every 4th row from 0 to 10 position.

emp.loc[emp.index[0:10:4], ['gender', 'base_salary']]
gender base_salary
unique_id
10 Female 121862.0
50 Male 56347.0
90 Male 107962.0

In the below example df.columns.get_loc('base_salary') returns the integer position of the column base_salary.

 emp.iloc[[0,4], emp.columns.get_loc('base_salary') ]
unique_id
10    121862.0
50     56347.0
Name: base_salary, dtype: float64

In the below example df.columns.get_indexer(['base_salary', 'gender' ]) returns the integer position of the all specified columns.

emp.iloc[[0,4], emp.columns.get_indexer(['base_salary', 'gender' ])]
base_salary gender
unique_id
10 121862.0 Female
50 56347.0 Male

Boolean Indexing

Tip : Use () to seperate the comparisons.

mask = (emp['base_salary'] > 100000) | (emp['gender'] == 'Female') & (emp['employment_status'] == 'Active')
emp[mask]
position_title department base_salary gender employment_status hire_date job_date
unique_id
10 ASSISTANT DIRECTOR (EX LVL) Municipal Courts Department 121862.0 Female Active 2006-06-12 2012-10-13
20 LIBRARY ASSISTANT Library 26125.0 Female Active 2000-07-19 2010-09-18
90 DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV Public Works & Engineering-PWE 107962.0 Male Active 1993-11-15 2013-01-05
120 CHIEF PHYSICIAN,MD Health & Human Services 180416.0 Male Active 1987-05-22 1999-08-28
360 DENTAL ASSISTANT Health & Human Services 34923.0 Female Active 1982-04-05 1990-11-03
... ... ... ... ... ... ... ...
19840 ADMINISTRATIVE SPECIALIST Admn. & Regulatory Affairs 55172.0 Female Active 2006-10-16 2006-10-16
19890 SENIOR ASSISTANT CITY ATTORNEY II Legal Department 124115.0 Male Active 2013-01-23 2013-03-02
19910 BUILDING MAINTENANCE SUPERVISOR Parks & Recreation 30347.0 Female Active 1995-10-14 2010-03-20
19920 MOBILITY SERVICE OFFICER Houston Police Department-HPD 44429.0 Female Active 2005-09-12 2007-06-02
19950 POLICE CAPTAIN Houston Police Department-HPD 104455.0 Male Active 1983-02-07 2004-07-08

639 rows × 7 columns

A little bit more complicated example.

occupations = ['POLICE OFFICER', 'SENIOR POLICE OFFICER']
mask = ((emp['position_title'].map(lambda x: x in occupations)) & (emp['base_salary'] > 50000))
emp[mask].head(2)
position_title department base_salary gender employment_status hire_date job_date
unique_id
60 SENIOR POLICE OFFICER Houston Police Department-HPD 66614.0 Male Active 1984-11-26 2005-03-26
200 SENIOR POLICE OFFICER Houston Police Department-HPD 66614.0 Male Active 1992-02-17 2007-07-28
# Write above using isin

mask = ((emp['position_title'].isin(occupations)) & (emp['base_salary'] > 50000))
emp[mask].head(2)
position_title department base_salary gender employment_status hire_date job_date
unique_id
60 SENIOR POLICE OFFICER Houston Police Department-HPD 66614.0 Male Active 1984-11-26 2005-03-26
200 SENIOR POLICE OFFICER Houston Police Department-HPD 66614.0 Male Active 1992-02-17 2007-07-28

Where

emp.where(emp['gender'] == 'Female', False)
position_title department base_salary gender employment_status hire_date job_date
unique_id
10 ASSISTANT DIRECTOR (EX LVL) Municipal Courts Department 121862 Female Active 2006-06-12 2012-10-13
20 LIBRARY ASSISTANT Library 26125 Female Active 2000-07-19 2010-09-18
30 False False False False False False False
40 False False False False False False False
50 False False False False False False False
... ... ... ... ... ... ... ...
19960 False False False False False False False
19970 False False False False False False False
19980 False False False False False False False
19990 False False False False False False False
20000 False False False False False False False

2000 rows × 7 columns

Query

 emp.query('base_salary > 100000 and employment_status == "Active" and gender == "Female"')
position_title department base_salary gender employment_status hire_date job_date
unique_id
10 ASSISTANT DIRECTOR (EX LVL) Municipal Courts Department 121862.0 Female Active 2006-06-12 2012-10-13
670 PUBLIC HEALTH DENTIST,DDS Health & Human Services 100791.0 Female Active 2015-12-28 2015-12-28
2380 ASSISTANT DIRECTOR (EXECUTIVE LEVEL) Admn. & Regulatory Affairs 130416.0 Female Active 2002-05-24 2013-07-20
3670 DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV Mayor's Office 110000.0 Female Active 2014-05-13 2014-05-13
5230 DEPUTY ASSISTANT DIRECTOR (EX LVL) Houston Airport System (HAS) 110686.0 Female Active 2011-11-07 2011-11-07
... ... ... ... ... ... ... ...
15610 SENIOR ASSISTANT CITY ATTORNEY II Legal Department 122220.0 Female Active 2005-09-12 2012-07-21
17790 DIVISION MANAGER Human Resources Dept. 110547.0 Female Active 2000-01-29 2010-12-25
17850 JUDGE OF MUNICIPAL COURTS Municipal Courts Department 117176.0 Female Active 1999-09-25 2010-11-13
19440 ASSISTANT DIRECTOR (EXECUTIVE LEVEL) Houston Airport System (HAS) 128606.0 Female Active 1993-02-22 2004-09-04
19470 SENIOR ASSISTANT CITY ATTORNEY II Legal Department 117831.0 Female Active 2005-07-18 2012-07-21

21 rows × 7 columns

MultiIndexing and Advanced Indexing

Code

Description

pd.MultiIndex.from_*

Set of methods which create multiindex from various sources such as tuples, arrays …

Setting multiple indexes on a dataframe.

import pandas as pd

emp = pd.read_csv('data/employee.csv')
emp = emp.set_index(['gender','unique_id'])
emp
position_title department base_salary employment_status hire_date job_date
gender unique_id
Female 10 ASSISTANT DIRECTOR (EX LVL) Municipal Courts Department 121862.0 Active 2006-06-12 2012-10-13
20 LIBRARY ASSISTANT Library 26125.0 Active 2000-07-19 2010-09-18
Male 30 POLICE OFFICER Houston Police Department-HPD 45279.0 Active 2015-02-03 2015-02-03
40 ENGINEER/OPERATOR Houston Fire Department (HFD) 63166.0 Active 1982-02-08 1991-05-25
50 ELECTRICIAN General Services Department 56347.0 Active 1989-06-19 1994-10-22
... ... ... ... ... ... ...
19960 POLICE OFFICER Houston Police Department-HPD 43443.0 Active 2014-06-09 2015-06-09
19970 COMMUNICATIONS CAPTAIN Houston Fire Department (HFD) 66523.0 Active 2003-09-02 2013-10-06
19980 POLICE OFFICER Houston Police Department-HPD 43443.0 Active 2014-10-13 2015-10-13
19990 POLICE OFFICER Houston Police Department-HPD 55461.0 Active 2009-01-20 2011-07-02
20000 FIRE FIGHTER Houston Fire Department (HFD) 51194.0 Active 2009-01-12 2010-07-12

2000 rows × 6 columns

emp.loc['Female']
position_title department base_salary employment_status hire_date job_date
unique_id
10 ASSISTANT DIRECTOR (EX LVL) Municipal Courts Department 121862.0 Active 2006-06-12 2012-10-13
20 LIBRARY ASSISTANT Library 26125.0 Active 2000-07-19 2010-09-18
360 DENTAL ASSISTANT Health & Human Services 34923.0 Active 1982-04-05 1990-11-03
370 SENIOR INSPECTOR Public Works & Engineering-PWE 60258.0 Active 2004-08-03 2007-03-10
390 SENIOR MICROCOMPUTER ANALYST Health & Human Services 67499.0 Active 1994-06-29 1997-08-30
... ... ... ... ... ... ...
19770 SENIOR SIDELOADER OPERATOR Solid Waste Management 34466.0 Active 2015-07-20 2016-01-30
19790 POLICE SERVICE OFFICER Houston Police Department-HPD 27914.0 Active 2011-10-14 2014-03-01
19840 ADMINISTRATIVE SPECIALIST Admn. & Regulatory Affairs 55172.0 Active 2006-10-16 2006-10-16
19910 BUILDING MAINTENANCE SUPERVISOR Parks & Recreation 30347.0 Active 1995-10-14 2010-03-20
19920 MOBILITY SERVICE OFFICER Houston Police Department-HPD 44429.0 Active 2005-09-12 2007-06-02

603 rows × 6 columns

emp.loc['Female'].loc[10]
position_title       ASSISTANT DIRECTOR (EX LVL)
department           Municipal Courts Department
base_salary                               121862
employment_status                         Active
hire_date                             2006-06-12
job_date                              2012-10-13
Name: 10, dtype: object

Setting options

You can set options for pandas using set_option method e.g. pd.set_option('display.width', None)

Option

Description

display.width

Use full width of terminal

display.max_columns

Set as None to display all cols

Text

Refer to the following link to get various string related methods and their uses.

String methods are used as follows Series.str.method(), to chain follow the pattern Series.str.method().str.method()

Snippets

emp[emp['department'].str.contains('Lib')]
position_title department base_salary employment_status hire_date job_date
gender unique_id
Female 20 LIBRARY ASSISTANT Library 26125.0 Active 2000-07-19 2010-09-18
1020 CUSTOMER SERVICE CLERK Library NaN Active 2016-02-29 2016-02-29
Male 2810 SENIOR CUSTOMER SERVICE CLERK Library 31034.0 Active 2012-11-13 2015-12-19
2990 SENIOR LIBRARY SERVICE SPECIALIST Library 38563.0 Active 2012-03-05 2015-02-28
Female 3090 LIBRARIAN II Library 49317.0 Active 2014-08-04 2015-10-10
... ... ... ... ... ... ...
18470 LIBRARY ASSISTANT Library 26125.0 Active 2008-05-05 2013-01-19
Male 18970 DATA BASE ANALYST Library 59371.0 Active 2005-08-06 2013-10-26
Female 19270 LIBRARIAN I Library 39874.0 Active 2009-02-11 2014-10-11
19640 SENIOR COMMUNITY LIAISON Library 51542.0 Active 2015-11-02 2015-11-02
19690 CUSTOMER SERVICE CLERK Library NaN Active 2013-08-05 2013-08-05

36 rows × 6 columns

 emp['department'].str.lower().str.split()
gender  unique_id
Female  10              [municipal, courts, department]
        20                                    [library]
Male    30            [houston, police, department-hpd]
        40           [houston, fire, department, (hfd)]
        50              [general, services, department]
                                    ...                
        19960         [houston, police, department-hpd]
        19970        [houston, fire, department, (hfd)]
        19980         [houston, police, department-hpd]
        19990         [houston, police, department-hpd]
        20000        [houston, fire, department, (hfd)]
Name: department, Length: 2000, dtype: object

Merge & Joins

Grouping & Aggregations

Code

Description

df.groupby('colname')

Groupby a single column

df.groupby(['colname', 'colname'])

Groupby multiple columns in order of listing

df.groupby(colname)[col].agg(agg_function)

Aggregate on a column after group by

df.groupby([col, col])[[col, col]].agg([func1, func2])

Aggregate multiple funcs on multi column after group by on multi cols.

df.groupby([col, col])[[col, col]].agg({col:[func1, func2], col[func3, func4]})

Aggregate multiple funcs on multi column after group by on multi cols. (Seperate aggregation for seperate cols.)

df.groupby(colname)[col].agg(agg_function).rename(columns = {})

rename cols

df.groupby(colname).agg(final_col_name = pd.NamedAgg(column=aggregate_on_col, aggfunc=func_name))

Named Aggregation

Snippets

flights = pd.read_csv('data/flights.csv')
flights
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58487 12 31 4 AA SFO DFW 515 5.0 166.0 1464 1045 -19.0 0 0
58488 12 31 4 F9 LAS SFO 1910 13.0 71.0 414 2050 4.0 0 0
58489 12 31 4 OO SFO SBA 1846 -6.0 46.0 262 1956 -5.0 0 0
58490 12 31 4 WN MSP ATL 525 39.0 124.0 907 855 34.0 0 0
58491 12 31 4 OO SFO BOI 859 5.0 73.0 522 1146 -1.0 0 0

58492 rows × 14 columns

flights.groupby('AIRLINE').describe()
MONTH DAY ... DIVERTED CANCELLED
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
AIRLINE
AA 8900.0 6.661461 3.386332 1.0 4.00 7.0 9.0 12.0 8900.0 15.570562 ... 0.0 1.0 8900.0 0.017303 0.130407 0.0 0.0 0.0 0.0 1.0
AS 768.0 6.197917 3.343428 1.0 3.75 6.0 9.0 12.0 768.0 15.247396 ... 0.0 0.0 768.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
B6 543.0 6.368324 3.380085 1.0 4.00 6.0 9.0 12.0 543.0 15.788214 ... 0.0 1.0 543.0 0.001842 0.042914 0.0 0.0 0.0 0.0 1.0
DL 10601.0 6.277332 3.340809 1.0 3.00 6.0 9.0 12.0 10601.0 15.721158 ... 0.0 1.0 10601.0 0.003585 0.059767 0.0 0.0 0.0 0.0 1.0
EV 5858.0 6.003926 3.318475 1.0 3.00 6.0 8.0 12.0 5858.0 15.866166 ... 0.0 1.0 5858.0 0.024923 0.155904 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
OO 6588.0 6.134791 3.331999 1.0 3.00 6.0 9.0 12.0 6588.0 15.651943 ... 0.0 1.0 6588.0 0.021554 0.145234 0.0 0.0 0.0 0.0 1.0
UA 7792.0 6.322510 3.326315 1.0 4.00 6.0 9.0 12.0 7792.0 15.697510 ... 0.0 1.0 7792.0 0.011935 0.108602 0.0 0.0 0.0 0.0 1.0
US 1615.0 3.628483 1.714630 1.0 2.00 4.0 5.0 6.0 1615.0 16.079876 ... 0.0 1.0 1615.0 0.013003 0.113322 0.0 0.0 0.0 0.0 1.0
VX 993.0 6.336354 3.439813 1.0 4.00 6.0 9.0 12.0 993.0 15.518630 ... 0.0 1.0 993.0 0.006042 0.077536 0.0 0.0 0.0 0.0 1.0
WN 8418.0 6.298646 3.373491 1.0 3.00 6.0 9.0 12.0 8418.0 15.746971 ... 0.0 1.0 8418.0 0.011048 0.104532 0.0 0.0 0.0 0.0 1.0

14 rows × 88 columns

flights.groupby('AIRLINE')['DEP_DELAY'].mean()
AIRLINE
AA    11.274057
AS     1.808594
B6    14.287823
DL     7.242450
EV     9.092625
        ...    
OO     9.662279
UA    16.210485
US     4.361355
VX     8.918946
WN    12.639366
Name: DEP_DELAY, Length: 14, dtype: float64
flights.groupby(['AIRLINE', 'WEEKDAY'])[['CANCELLED', 'DIVERTED']].agg([np.sum, np.mean, np.std])
CANCELLED DIVERTED
sum mean std sum mean std
AIRLINE WEEKDAY
AA 1 41 0.032106 0.176352 6 0.004699 0.068411
2 9 0.007341 0.085399 2 0.001631 0.040373
3 16 0.011949 0.108698 2 0.001494 0.038633
4 20 0.015004 0.121613 5 0.003751 0.061153
5 18 0.014151 0.118160 1 0.000786 0.028039
... ... ... ... ... ... ... ...
WN 3 18 0.014118 0.118022 2 0.001569 0.039590
4 10 0.007911 0.088629 4 0.003165 0.056188
5 7 0.005828 0.076153 0 0.000000 0.000000
6 10 0.010132 0.100196 3 0.003040 0.055076
7 7 0.006066 0.077681 3 0.002600 0.050943

98 rows × 6 columns

flights.groupby(['AIRLINE'])[['AIRLINE','CANCELLED', 'DEP_DELAY']]\
    .agg({'AIRLINE':'count','CANCELLED':[np.sum], 'DEP_DELAY':[np.mean, np.std]})\
    .rename(columns={'count':'Total Flights', 'sum':'Total Cancelled','std':'Standard Deviation', 'mean':'Average Delay'})
AIRLINE CANCELLED DEP_DELAY
Total Flights Total Cancelled Average Delay Standard Deviation
AIRLINE
AA 8900 154 11.274057 40.787806
AS 768 0 1.808594 29.543702
B6 543 1 14.287823 36.863416
DL 10601 38 7.242450 29.759377
EV 5858 146 9.092625 34.582917
... ... ... ... ...
OO 6588 142 9.662279 33.413767
UA 7792 93 16.210485 44.834336
US 1615 21 4.361355 24.550930
VX 993 6 8.918946 31.782085
WN 8418 93 12.639366 30.506549

14 rows × 4 columns

Named Aggregation

flights.groupby(['AIRLINE'])\
    .agg(
         **{
             "Total Flights":pd.NamedAgg(column='AIRLINE',aggfunc='count'),
             "Total Cancelled":pd.NamedAgg(column='CANCELLED',aggfunc=np.sum),
             "Average_Delay":pd.NamedAgg(column='DEP_DELAY',aggfunc=np.mean),
             "Std Dev on Departure Delay":pd.NamedAgg(column='DEP_DELAY',aggfunc=np.std),
             "Funny Val":pd.NamedAgg(column='DEP_DELAY',aggfunc=lambda x: (np.std(x) + 2))  # Can supply lambda as well.
         }
)
Total Flights Total Cancelled Average_Delay Std Dev on Departure Delay Funny Val
AIRLINE
AA 8900 154 11.274057 40.787806 42.785475
AS 768 0 1.808594 29.543702 31.524462
B6 543 1 14.287823 36.863416 38.829394
DL 10601 38 7.242450 29.759377 31.757969
EV 5858 146 9.092625 34.582917 36.579895
... ... ... ... ... ...
OO 6588 142 9.662279 33.413767 35.411180
UA 7792 93 16.210485 44.834336 46.831427
US 1615 21 4.361355 24.550930 26.543228
VX 993 6 8.918946 31.782085 33.765981
WN 8418 93 12.639366 30.506549 32.504717

14 rows × 5 columns

flights
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58487 12 31 4 AA SFO DFW 515 5.0 166.0 1464 1045 -19.0 0 0
58488 12 31 4 F9 LAS SFO 1910 13.0 71.0 414 2050 4.0 0 0
58489 12 31 4 OO SFO SBA 1846 -6.0 46.0 262 1956 -5.0 0 0
58490 12 31 4 WN MSP ATL 525 39.0 124.0 907 855 34.0 0 0
58491 12 31 4 OO SFO BOI 859 5.0 73.0 522 1146 -1.0 0 0

58492 rows × 14 columns

TimeSeries

Working with dates and times is a complex topic, only a small subset of functionality is covered here.

While working with pandas timeseries you can use the following time related concepts

  • Date time - A specific date and time with timezone support

  • Time deltas - absolute time duration

  • Time Spans - A span of time defined by a point in time and its associated frequency.

  • Date Offsets - A relative time duration that respects calendar arithmetic.

For details on the methods refer to the following link

Code

Description

pd.date_range('2020-01-01', periods=10,freq='H')

Creating dates

pd.Timedelta

Used for arithmetic of date and time

df['col'].astype('datetime64')

Convert to datetype. For timezone use 'datetime64[ns, US/Eastern]' Supports from nanosec to weeks.

pd.to_datetime(df['col'])

Converts to datetime from various formats.

TimeDelta for calculations

day1 = pd.Timestamp('2020-01-01')
day1 + pd.Timedelta(1, 'D')
Timestamp('2020-01-02 00:00:00')

Convert objects to datetime

emp['hire_date'] = emp.hire_date.astype('datetime64')
emp['job_date'] = emp.job_date.astype('datetime64')

Comments

comments powered by Disqus