As a Python developer, you will often have to work with large datasets. Python is known for being a language that is well-suited to this task.
With that said, Python itself does not have much in the way of built-in capabilities for data analysis. Instead, data analysts make use of a Python library called pandas.
In this article, you will learn how to import and manipulate large datasets in Python using pandas.
How to Import Pandas
To use pandas in a Python script, you will first need to import it. It is convention to import pandas under the alias
pd, like this:
If pandas is not already installed on your machine, you will encounter an error.
Here is how you can install pandas at the command line using the pip package manager:
How To Find Data Sets
Finding data sets to practice on is an important step in growing your skills as a data scientist. I recommend using the UCI Machine Learning repository, which is a repository of free, open-source datasets to practice machine learning on.
We will be using the
wine-quality dataset from the UCI Machine Learning repository in this tutorial. The URL for this data set is listed below:
How To Import
.csv Data Sets
The pandas library includes built-in functions for importing (and saving) data. In the case of the
wine-quality dataset, the original file path leads to a
.csv file. I will explain how to deal with other file types later on in this article.
We can import
.csv files into a Python application using pandas with the
read_csv method, which stores the data in the spreadsheet-like DataFrame object.
As an example, here’s how you would import the
wine-quality data set using the URL that I introduced earlier:
Note that the
sep=';' argument is required because the values within this
.csv file are separated by semicolons, not by commas like what the program expects.
Now that the
.csv file has been imported as a pandas DataFrame, it is very easy to export the file to your local machine using the
to_csv method. For this method, the argument is the name of the file that you’re trying to save.
As an example, here is how you would save the DataFrame as a
.csv file called
If you look in the directory where you ran this Python script, you should now see the
How To Import
.json Data Sets
You will often run into data sets that are not saved as
.csv files. Because of this, I wanted to spend a moment explaining how to import files that have the
.json file type.
First, let’s start with
.xlsx files. These are typically files that were generated and saved using Microsoft Excel. Just like we used the
to_csv to read and write
.csv files, we use the following two methods to read and write
Files with the
.json file extension are similar, and use the following two methods:
In the next several sections of this tutorial, you will learn how to manipulate the data imported from the
.csv file earlier on.
How To Generate a Pandas Series From A Pandas DataFrame
We’ve already referenced the pandas DataFrame object several times, which is one of the cornerstones of the pandas programming library. Pandas includes another object called a Series which is quite similar to a NumPy array.
Pandas includes some very simple syntax that allows you to easily turn a column from a pandas DataFrame into a standalone pandas Series. You simply pass in the name of the column in square brackets at the end of a DataFrame’s variable name.
As an example, here is how you would create a Series called
my_new_series that is equal to the
chlorides column of the wine quality database:
Now that this new pandas Series has been created, you can manipulate the series using the various pandas Series methods included in the library.
How To Work With Missing Data in Pandas
One of the prime frustrations of data scientists is the problem of missing data. Fortunately, pandas makes it very easy to handle missing data when they are contained in pandas DataFrames.
dropna method can be used to drop every row that contains missing values. Here’s how you would do this with the
data DataFrame we created earlier in this lesson:
If you want to drop columns (instead of rows), you can pass in
axis=1 as an argument of the
dropna method, like this:
You can also use the
fillna method to replace missing values with a predetermined value. There are two main approaches to using the
The first approach is to replace missing values with a static value, like
0. Here’s how you would do this in our
The second approach is more complex. It involves replacing missing data with the average value of either:
- The entire DataFrame
- A specific column of the DataFrame
Here’s how you would replace
data's missing values with the average value of the entire DataFrame:
If you wanted to replace the missing values in
data with the average values of a specific column (say,
volatile acidity), then you could do so with the following code:
Pandas DataFrames Common Operations
Now that you have imported data and dealt with its missing data, let’s discuss a few ways to manipulate data stored in pandas DataFrames.
First, you may want to sort a pandas DataFrame based on its values in a particular column. The
sort_values method is ideal for this. For example, we can sort our
data DataFrame based on the column
density as follows:
There are also several methods that apply specifically pandas Series (which are just columns of a DataFrame, remember):
unique: eliminates duplicate entries in a pandas Series
nunique: counts the number of unique entries in a pandas Series
value_counts: allows you to count the number of times a specific observation occurs in a pandas Series
apply: allows you to apply an outside function to every element of a pandas Series
In this tutorial, you learned how to import and manipulate large datasets in Python using pandas. Please feel free to refer back to this tutorial if you ever get stuck on large datasets in the future.
This article was written by Nick McCullum, who teaches people how to code on his website.