Working With Messy Information Utilizing Pandas in Python



Hello beautiful folks! ? I’m engaged on a challenge and needed to wrangle some messy information utilizing Pandas. New challenge means new concepts for articles and so right here I’m with an article about the best way to work with messy information utilizing Pandas in Python. I will probably be overlaying solely among the fundamentals that you will want to know for many of your information evaluation initiatives. So with out losing any time let’s start!


My dataset is a CSV file containing a seasonal recreation schedule for a selected basketball group. The primary aim is to course of this information utilizing Pandas, change sure data, take away sure rows, and sum sure columns. Briefly, your typical information processing stuff. You may get the enter file from right here. That is what the file seems like:
Input Data

Studying the information right into a Pandas DataFrame

This information is in a CSV file named input_file.csv. We are able to make use of the read_csv technique of the Pandas library to load it in. Usually you’d learn a CSV like this:

import pandas as pd
schedule_dataframe = pd.read_csv('input_file.csv')

That is the output after we attempt printing the DataFrame:

>>> print(schedule_dataframe)
                                   Tue Nov 5    234  ...    .1 Unnamed: 11
0                                  Sat Nov 9  235.0  ...               NaN
1                                 Wed Nov 13  240.0  ...               NaN
2                                 Sat Nov 16  232.0  ...               NaN
3                                 Solar Nov 24  224.0  ...               NaN
4                                 Wed Nov 27  251.0  ...               NaN
5                                 Fri Nov 29  244.0  ...               NaN
6                                 Sat Nov 30  258.0  ...               NaN
7                                  Wed Dec 4  259.0  ...               NaN
8                                  Sat Dec 7  257.0  ...               NaN
9                                 Wed Dec 11  261.0  ...               NaN
10                                Sat Dec 14  271.0  ...               NaN
11                                Solar Dec 22  284.0  ...               NaN
12                                Sat Dec 28  283.0  ...               NaN
13                                 Fri Jan 3  276.0  ...               NaN
14                                 Thu Jan 9  277.0  ...   0-1         NaN
15                                Sat Jan 11  278.0  ...   0-2         NaN
16                                Wed Jan 15  279.0  ...   0-3         NaN
17                                Sat Jan 18  287.0  ...   1-3         NaN
18                                Wed Jan 22  284.0  ...   2-3         NaN
19                                Sat Jan 25  273.0  ...   3-3         NaN
20                                Thu Jan 30  269.0  ...   3-4         NaN
21                                 Sat Feb 1  267.0  ...   3-5         NaN
22                                 Wed Feb 5  276.0  ...   4-5         NaN
23                                 Sat Feb 8  262.0  ...   4-6         NaN
24                                Sat Feb 15  265.0  ...   5-6         NaN
25                                Thu Feb 20  272.0  ...   5-7         NaN
26                                Thu Feb 27  275.0  ...   5-8         NaN
27                                Sat Feb 29  275.0  ...   5-9         NaN
28                                 Thu Mar 5  274.0  ...  5-10         NaN
29                                 Sat Mar 7  268.0  ...  6-10         NaN
30         Residence video games performed at Titan Gymnasium (4    0.0  ...   NaN         NaN
31  Residence court docket benefit: 2.1 (337th in D-I)    NaN  ...   NaN         NaN

[32 rows x 12 columns]

There are a few necessary issues to notice right here. Pandas doesn’t assign a header to this DataFrame. It tries to determine the header robotically but it surely makes a judgment name that there isn’t any header within the CSV. In such instances, we will provide a customized header.

We are able to assign customized column names (header) as a part of the read_csv technique. There are 12 columns in my CSV so I’ll present an inventory of 12 names:

schedule_dataframe = pd.read_csv('input_file.csv',
                                names=['date', 'v', 'w', 'competitor_name', 
                                        'result', 'score', 'x', 'y', 'location', 
                                        'record', 'conf_record', 'z'])

The one factor you want to bear in mind is that the names have to be distinctive. I don’t care concerning the u, v, x, y, or z columns. I don’t know precisely what information they include so that’s the reason the names aren’t significant.

Now if we attempt to print this DataFrame we will see a header within the output:

Unprocessed Data

Changing information

For my challenge, I wanted to take away these rows from the information whose w column contained “NR”. Pandas supplies us a way to drop these rows whose sure columns include NaN values. Nevertheless, NR isn’t equal to NaN so first, we have to change each incidence of NR with NaN. We are able to simply do this by utilizing the change technique of the DataFrame:

import numpy as np
schedule_dataframe['w'] = schedule_dataframe['w'].change('NR', np.nan)

Now we will use the dropna technique of the DataFrame to simply take away the rows whose w column comprises NaN:

schedule_dataframe = schedule_dataframe.dropna(axis=0, subset=['w'])

The axis=0 tells Pandas to drop the row as an alternative of the column. If we attempt printing the DataFrame now, the output will look one thing like this:

Processed Data

Whereas eradicating the rows, Pandas doesn’t replace the index column. I don’t like that so whereas we’re at it, let’s reset the index column as properly in order that now we have a steady vary of index values:

schedule_dataframe = schedule_dataframe.reset_index(drop=True)

The output ought to now include up to date index values:

Processed Data

Eradicating a row from the DataFrame

I don’t need the z column in my DataFrame because it comprises solely NaN values so let’s drop that column from my DataFrame:

schedule_dataframe = schedule_dataframe.drop(['z'], axis=1)

Candy! Now the output seems a lot cleaner:

Processed Data

Throughout this step, I went forward and eliminated the final row as properly. I didn’t want it.

Changing column right into a NumPy array

Now our w column solely comprises integer values. Let’s say I wished to check that entire sorted column with another record. How can we do this? Because it seems, Pandas makes it extremely straightforward for us to do precisely that. We simply want to make use of the to_numpy technique.

Right here is the way you do it:

scores = schedule_dataframe['w'].to_numpy()

# array(['13', '43', '246', '108', '340', '305', '151', '120', '183', '230',
#       '74', '209', '78', '208', '211', '106', '297', '225', '233', '315',
#       '106', '211', '225', '315', '160', '208', '233', '160', '297'],
#      dtype=object)

Now you’ll be able to kind them and do no matter you need with them.


That is the place this brief tutorial involves an finish. Pandas is basically highly effective and this tutorial doesn’t do it justice. Strive it on any information processing or information evaluation duties and let me know the way it goes. There’s a purpose it’s an integral a part of any information analysts’ instrument chest. I hope you discovered one thing new right this moment. I’ll see you within the subsequent article. Keep secure and take care! ? ❤️