Pandas | Dataframes

Posted by Sharonda Pettiett on September 3, 2019
Published on Medium Article
Pandas - DataFrames
Pandas is one of the most popular Python libraries for Data Science and Analytics.

Why? Because pandas helps you to manage two-dimensional data tables in Python. In this pandas tutorial series about dataframes, I’ll show you the some important things that you have to know as a Data Scientist.

We will start from the basics!

Import Libraries
# Import Pandas Library
import pandas as pd

# Import Other Libraries
import numpy as np
import matplotlib as plt
%matplotlib inline
How to open data files in pandas
You might have your data in .csv files, excel files, from a URL or SQL tables. In all cases, if you want to analyze that data using pandas, the first step will be to read it into a data structure that’s compatible with pandas like dataframe.

Read in data from .csv file into Dataframe Structure

# EXAMPLE_
# df = pd.read_csv("./filename.csv")
Read in data from URL into Dataframe Structure

data_url = 'http://bit.ly/2cLzoxH'
df = pd.read_csv(data_url)   # Read data contained in URL into dataframe (df)
Preview Data in the dataframe.
Sometimes, it makes sense not to print the whole dataframe and flood your screen with lots of data. When a few lines is enough, you can print only the first 5 rows or last 5 rows – by typing the following:

df.head() # Preview the first (5) rows in the dataframe = default
country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106
df.tail() # Preview the last (5) rows in the dataframe = default
country	year	pop	continent	lifeExp	gdpPercap
1699	Zimbabwe	1987	9216418.0	Africa	62.351	706.157306
1700	Zimbabwe	1992	10704340.0	Africa	60.377	693.420786
1701	Zimbabwe	1997	11404948.0	Africa	46.809	792.449960
1702	Zimbabwe	2002	11926563.0	Africa	39.989	672.038623
1703	Zimbabwe	2007	12311143.0	Africa	43.487	469.709298
df.head(10) # Preview the first (10 or specify any #) rows in the dataframe
country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106
5	Afghanistan	1977	14880372.0	Asia	38.438	786.113360
6	Afghanistan	1982	12881816.0	Asia	39.854	978.011439
7	Afghanistan	1987	13867957.0	Asia	40.822	852.395945
8	Afghanistan	1992	16317921.0	Asia	41.674	649.341395
9	Afghanistan	1997	22227415.0	Asia	41.763	635.341351
What is the shape of the data in the dataframe. (Number of Rows? Number of Columns?)

df.shape  # 1704 -rows and 6-columns
(1704, 6)
Select specific columns of your dataframe
Let’s say you want to print the "country", "year", and "pop" columns only. You should use this syntax:

df[["country", "year", "pop"]].head(10)
country	year	pop
0	Afghanistan	1952	8425333.0
1	Afghanistan	1957	9240934.0
2	Afghanistan	1962	10267083.0
3	Afghanistan	1967	11537966.0
4	Afghanistan	1972	13079460.0
5	Afghanistan	1977	14880372.0
6	Afghanistan	1982	12881816.0
7	Afghanistan	1987	13867957.0
8	Afghanistan	1992	16317921.0
9	Afghanistan	1997	22227415.0
NOTE: Any guesses why we have to use double bracket frames? It seems a bit over-complicated, I admit, but maybe this will help you remember: the outer bracket frames tell pandas that you want to select columns, and the inner brackets are for the list (remember? Python lists go between bracket frames) of the column names.

You can get a Series (Column) using this syntax (and selecting only one column):
df["country"].head(25)
0     Afghanistan
1     Afghanistan
2     Afghanistan
3     Afghanistan
4     Afghanistan
5     Afghanistan
6     Afghanistan
7     Afghanistan
8     Afghanistan
9     Afghanistan
10    Afghanistan
11    Afghanistan
12        Albania
13        Albania
14        Albania
15        Albania
16        Albania
17        Albania
18        Albania
19        Albania
20        Albania
21        Albania
22        Albania
23        Albania
24        Algeria
Name: country, dtype: object
# View unique values for Series (Column) values:
df["country"].unique()
array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo Dem. Rep.', 'Congo Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea Dem. Rep.',
       'Korea Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
       'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',
       'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',
       'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
       'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',
       'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',
       'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',
       'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',
       'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
       'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia',
       'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay',
       'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Yemen Rep.',
       'Zambia', 'Zimbabwe'], dtype=object)
Filter for specific values in your dataframe
Let’s say, you want to see a list of only the users who came from the ‘United Kingdom’. In this case you have to filter for the ‘United Kingdom’ value in the ‘country’ column:

df_uk_rows = df[df["country"]=="United Kingdom"]
df_uk_rows
country	year	pop	continent	lifeExp	gdpPercap
1596	United Kingdom	1952	50430000.0	Europe	69.180	9979.508487
1597	United Kingdom	1957	51430000.0	Europe	70.420	11283.177950
1598	United Kingdom	1962	53292000.0	Europe	70.760	12477.177070
1599	United Kingdom	1967	54959000.0	Europe	71.360	14142.850890
1600	United Kingdom	1972	56079000.0	Europe	72.010	15895.116410
1601	United Kingdom	1977	56179000.0	Europe	72.760	17428.748460
1602	United Kingdom	1982	56339704.0	Europe	74.040	18232.424520
1603	United Kingdom	1987	56981620.0	Europe	75.007	21664.787670
1604	United Kingdom	1992	57866349.0	Europe	76.420	22705.092540
1605	United Kingdom	1997	58808266.0	Europe	77.218	26074.531360
1606	United Kingdom	2002	59912431.0	Europe	78.471	29478.999190
1607	United Kingdom	2007	60776238.0	Europe	79.425	33203.261280
Functions can be used after each other
It’s very important to understand that pandas’s logic is very linear. So if you apply a function, you can always apply another one on it. In this case, the input of the latter function will always be the output of the previous function.

df.count().head()
country      1704
year         1704
pop          1704
continent    1704
lifeExp      1704
dtype: int64
df.mean().head()
year         1.979500e+03
pop          2.960121e+07
lifeExp      5.947444e+01
gdpPercap    7.215327e+03
dtype: float64
df["pop"].max()
1318683096.0
df["pop"].count()
1704
df[["lifeExp", "pop", "gdpPercap"]].max()
lifeExp      8.260300e+01
pop          1.318683e+09
gdpPercap    1.135231e+05
dtype: float64
df[["country","lifeExp", "pop", "gdpPercap"]].head()
country	lifeExp	pop	gdpPercap
0	Afghanistan	28.801	8425333.0	779.445314
1	Afghanistan	30.332	9240934.0	820.853030
2	Afghanistan	31.997	10267083.0	853.100710
3	Afghanistan	34.020	11537966.0	836.197138
4	Afghanistan	36.088	13079460.0	739.981106
Grouping in pandas
# Pandas .groupby in action
df.groupby("country").mean()
year	pop	lifeExp	gdpPercap
country				
Afghanistan	1979.5	1.582372e+07	37.478833	802.674598
Albania	1979.5	2.580249e+06	68.432917	3255.366633
Algeria	1979.5	1.987541e+07	59.030167	4426.025973
Angola	1979.5	7.309390e+06	37.883500	3607.100529
Argentina	1979.5	2.860224e+07	69.060417	8955.553783
Australia	1979.5	1.464931e+07	74.662917	19980.595634
Austria	1979.5	7.583298e+06	73.103250	20411.916279
Bahrain	1979.5	3.739132e+05	65.605667	18077.663945
Bangladesh	1979.5	9.075540e+07	49.834083	817.558818
Belgium	1979.5	9.725119e+06	73.641750	19900.758072
Benin	1979.5	4.017497e+06	48.779917	1155.395107
Bolivia	1979.5	5.610395e+06	52.504583	2961.228754
Bosnia and Herzegovina	1979.5	3.816525e+06	67.707833	3484.779069
Botswana	1979.5	9.711862e+05	54.597500	5031.503557
Brazil	1979.5	1.223121e+08	62.239500	5829.316653
Bulgaria	1979.5	8.182985e+06	69.743750	6384.055172
Burkina Faso	1979.5	7.548677e+06	44.694000	843.990665
Burundi	1979.5	4.651608e+06	44.817333	471.662990
Cambodia	1979.5	8.510431e+06	47.902750	675.367824
Cameroon	1979.5	9.816648e+06	48.128500	1774.634222
Canada	1979.5	2.446297e+07	74.902750	22410.746340
Central African Republic	1979.5	2.560963e+06	43.866917	958.784697
Chad	1979.5	5.329256e+06	46.773583	1165.453674
Chile	1979.5	1.120573e+07	67.430917	6703.289147
China	1979.5	9.581601e+08	61.785140	1488.307694
Colombia	1979.5	2.725610e+07	63.897750	4195.342920
Comoros	1979.5	3.616839e+05	52.381750	1314.380339
Congo Dem. Rep.	1979.5	3.268166e+07	44.543750	648.342646
Congo Rep.	1979.5	1.923209e+06	52.501917	3312.788215
Costa Rica	1979.5	2.400008e+06	70.181417	5448.610779
...	...	...	...	...
Sierra Leone	1979.5	3.605425e+06	36.769167	1072.819493
Singapore	1979.5	2.667817e+06	71.220250	17425.382267
Slovak Republic	1979.5	4.774507e+06	70.696083	10415.530689
Slovenia	1979.5	1.794381e+06	71.600750	14074.582109
Somalia	1979.5	5.197198e+06	40.988667	1140.793252
South Africa	1979.5	2.992835e+07	53.993167	7247.431074
Spain	1979.5	3.585180e+07	74.203417	14029.826479
Sri Lanka	1979.5	1.454583e+07	66.526083	1854.731119
Sudan	1979.5	2.156033e+07	48.400500	1835.010430
Swaziland	1979.5	6.790520e+05	49.002417	3163.352358
Sweden	1979.5	8.220029e+06	76.177000	19943.126104
Switzerland	1979.5	6.384293e+06	75.565083	27074.334405
Syria	1979.5	9.865379e+06	61.346167	3009.287981
Taiwan	1979.5	1.687472e+07	70.336667	10224.807181
Tanzania	1979.5	2.049950e+07	47.912333	849.281271
Thailand	1979.5	4.496163e+07	62.200250	3045.966474
Togo	1979.5	2.895964e+06	51.498750	1153.820116
Trinidad and Tobago	1979.5	1.006470e+06	66.828000	7866.871946
Tunisia	1979.5	6.686770e+06	60.721000	3477.210351
Turkey	1979.5	4.590901e+07	59.696417	4469.453380
Uganda	1979.5	1.436105e+07	47.618833	810.383788
United Kingdom	1979.5	5.608780e+07	73.922583	19380.472986
United States	1979.5	2.282112e+08	73.478500	26261.151347
Uruguay	1979.5	2.912487e+06	70.781583	7100.133176
Venezuela	1979.5	1.512980e+07	66.580667	10088.516252
Vietnam	1979.5	5.456857e+07	57.479500	1017.712615
West Bank and Gaza	1979.5	1.848606e+06	60.328667	3759.996781
Yemen Rep.	1979.5	1.084319e+07	46.780417	1569.274672
Zambia	1979.5	6.353805e+06	45.996333	1358.199409
Zimbabwe	1979.5	7.641966e+06	52.663167	635.858042
142 rows × 4 columns

df_uk_rows.groupby("country").mean()
year	pop	lifeExp	gdpPercap
country				
United Kingdom	1979.5	5.608780e+07	73.922583	19380.472986
Note: You can either ignore the year column, or you can remove it afterwards by using this syntax:

df_uk_rows.groupby("country").mean()[["pop","lifeExp","gdpPercap"]]
pop	lifeExp	gdpPercap
country			
United Kingdom	5.608780e+07	73.922583	19380.472986
df.groupby("continent").count()
country	year	pop	lifeExp	gdpPercap
continent					
Africa	624	624	624	624	624
Americas	300	300	300	300	300
Asia	396	396	396	396	396
Europe	360	360	360	360	360
Oceania	24	24	24	24	24
df[df["continent"] == "Oceania"].groupby("country").mean()
year	pop	lifeExp	gdpPercap
country				
Australia	1979.5	1.464931e+07	74.662917	19980.595634
New Zealand	1979.5	3.100032e+06	73.989500	17262.622813