Python : Working with Pandas

Pandas is a Python module for working with tabular data (i.e., data in a table with rows and columns). Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of Python.

  • A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table
  • You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.
  • DataFrames have rows and columns. Each column has a name, which is a string.
  • Each row has an index, which is an integer.
  • DataFrames can contain many different data types: strings, ints, floats, tuples, etc.

Please refer to how to install pandas to install it for python3.6

Pandas have following three data structures

  1. Series
  2. DataFrames
  3. Panels

we mostly will work with DataFrames. series can be considered as subset of DataFrames.

Creating Series and DataFrames

s1 = pd.Series(my_list)
s2 = pd.Series([1,3,5,6,8,11,18])
s3 = pd.Series(np.arange(100))

df2 = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 4)),columns=['a', 'b', 'c', 'd'])

>>> type(my_list)
<class 'list'>
>>> type(s1)
<class 'pandas.core.series.Series'>
>>> type(s4)
<class 'pandas.core.frame.DataFrame'>
# Creating DataFrame using dict
sales = [{'account': 'Jones LLC', 'Jun': 150, 'Jul': 200, 'Aug': 140},
{'account': 'Alpha Co', 'Jun': 200, 'Jul': 210, 'Aug': 215},
{'account': 'Blue Inc', 'Jun': 50, 'Jul': 90, 'Aug': 95 }]
df = pd.DataFrame(sales)
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'],
'Jun': [111, 222, 55],
'Jul': [222, 210, 90],
'Aug': [140, 215, 95]}
df = pd.DataFrame.from_dict(sales)

>>> df.dtypes
Feb int64
Jan int64
Mar int64
account object
dtype: object


df2 = pd.DataFrame(data,columns=['Store ID','Location','Number of Employees'])  
print df2  
Reading from and writing into CSV files
df = pd.read_csv('my-csv-file.csv')
#displaying top 5 rows
#displaying top 10 rows
#statistical information about df

df columns can be accessed by two ways


type of this data

<class 'pandas.core.series.Series'>

Accessing rows

print(df.loc[2]) #Python is zero indexed  

Creating sub dataframe

# create using specific columns
df2 =df[['clinic_north','clinic_south']]  
# create using certian conditon
df2=df[df.month == 'january']  
df2=df[df.age > 27]  
df2=df[ != 'Mumbai']  
# using multyiple conditions / In Python, | means "or" and & means "and".
df2 = df[(df.age < 30) | ( == 'Martha Jones')] # (curly brackets are must)  
df2 = df[df.month.isin(['January','February','March'])]  
# Reset index to fetch correct row number
Modifying DataFrames
#adding a column
#adding same value column
df['In Stock?'] = True  
df['In Stock?'] = 'Yes'  
#New column derived from other columns
df['Revenue'] = df.Price - df['Cost to Manufacture']  
#changing cap
from strings import lower  
df['Lowercase Name'] = df['Name'].apply(lower)  

Using lambda function to modify

get_last_name = lambda x: x.split(' ')[-1]  
df['last_name'] =  

Columns can be renamed using

df.columns = ['ID', 'Title','Category','Year Released','Rating']  
#selective renaming
df.rename(columns={'name': 'First Name','age': 'Age'},inplace=True)  

Using rename with only the columns keyword will create a new DataFrame, leaving your original DataFrame unchanged. That’s why we also passed in the keyword argument inplace=True. Using inplace=True lets us edit the original DataFrame.

wokring with rows amd columns

import pandas as pd

orders = pd.read_csv('shoefly.csv')

print orders.head(5)

orders['shoe_source'] = orders.shoe_material.apply(lambda x:   
                            'animal' if x == 'leather'else 'vegan')

orders['salutation'] = orders.apply(lambda row:   
                                    'Dear Mr. ' + row['last_name']
                                    if row['gender'] == 'male'
                                    else 'Dear Ms. ' + row['last_name'],


  • mean Average of all values in column
  • std Standard deviation
  • median Median
  • max Maximum value in column
  • min Minimum value in column
  • count Number of values in column
  • nunique Number of unique values in column
  • unique List of unique values in column
num_colors =orders.price.max()  
num_colors =orders.shoe_color.nunique()  
grades = df.groupby('student').grade.mean()  
pricey_shoes =orders.groupby('shoe_type').price.max()  
#groupby with muiple columns
shoe_counts = orders.groupby(['shoe_type','shoe_color']).id.count().reset_index()  

Sometimes, the operation that you want to perform is more complicated than mean or count. In those cases, you can use the apply method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.

cheap_shoes = orders.groupby('shoe_color').price.apply(lambda x: np.percentile(x,25)).reset_index()  

Pivots using dataframe

import pandas as pd

user_visits = pd.read_csv('page_visits.csv')  



click_source_by_month_pivot = click_source_by_month.pivot(index='utm_source',columns='month',values='id').reset_index()

Merge two dataframes
#this will happen if both dataframes have one column common
sales_vs_targets = pd.merge(sales,targets)  
#another way to merge
new_df = orders.merge(customers)  
#multile merge
new_df = orders.merge(customers).merge(products)  
#merge of columns name do not match
orders_products = pd.merge(orders,products.rename(columns={'id': 'product_id'}))  
#another way to merge if columns name do not match
orders_products = pd.merge(orders,products, left_on='product_id', right_on='id', suffixes=['_orders','_products'])  
pd.merge(company_a, company_b, how='outer')  
store_a_b_left = pd.merge(store_a, store_b, how='left')  
store_a_b_right = pd.merge(store_a, store_b, how='right')  
pd.concat([df1, df2])  
Query DataFrame
#simple condition
crushing_it = sales_vs_targets[sales_vs_targets.revenue >]  
#multiple conditions
results= all_data[(all_data.revenue > &(all_data.women>]  

Leave a Reply

Your email address will not be published. Required fields are marked *