How To Import and Manipulate Large Datasets in Python Using Pandas

A getting started guide to process large datasets in python

As a Python developer, you will often have to work with large datasets. Python is known for being a language that is well-suited to this task.

With that said, Python itself does not have much in the way of built-in capabilities for data analysis. Instead, data analysts make use of a Python library called pandas.

In this article, you will learn how to import and manipulate large datasets in Python using pandas.

How to Import Pandas

To use pandas in a Python script, you will first need to import it. It is convention to import pandas under the alias pd, like this:

import pandas as pd

If pandas is not already installed on your machine, you will encounter an error.

Here is how you can install pandas at the command line using the pip package manager:

pip install pandas

How To Find Data Sets

Finding data sets to practice on is an important step in growing your skills as a data scientist. I recommend using the UCI Machine Learning repository, which is a repository of free, open-source datasets to practice machine learning on.

We will be using the wine-quality dataset from the UCI Machine Learning repository in this tutorial. The URL for this data set is listed below:

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

How To Import .csv Data Sets

The pandas library includes built-in functions for importing (and saving) data. In the case of the wine-quality dataset, the original file path leads to a .csv file. I will explain how to deal with other file types later on in this article.

We can import .csv files into a Python application using pandas with the read_csv method, which stores the data in the spreadsheet-like DataFrame object.

As an example, here’s how you would import the wine-quality data set using the URL that I introduced earlier:

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(data_url, sep=';')

Note that the sep=';' argument is required because the values within this .csv file are separated by semicolons, not by commas like what the program expects.

Now that the .csv file has been imported as a pandas DataFrame, it is very easy to export the file to your local machine using the to_csv method. For this method, the argument is the name of the file that you’re trying to save.

As an example, here is how you would save the DataFrame as a .csv file called wine-quality-data.csv:

data.to_csv('wine-quality-data.csv')

If you look in the directory where you ran this Python script, you should now see the wine-quality-data.csv file!

How To Import .xlsx. And .json Data Sets

You will often run into data sets that are not saved as .csv files. Because of this, I wanted to spend a moment explaining how to import files that have the .xlsx and .json file type.

First, let’s start with .xlsx files. These are typically files that were generated and saved using Microsoft Excel. Just like we used the read_csv and to_csv to read and write .csv files, we use the following two methods to read and write .xlsx files:

  • read_excel
  • to_excel

Files with the .json file extension are similar, and use the following two methods:

  • read_json
  • to_json

In the next several sections of this tutorial, you will learn how to manipulate the data imported from the .csv file earlier on.

How To Generate a Pandas Series From A Pandas DataFrame

We’ve already referenced the pandas DataFrame object several times, which is one of the cornerstones of the pandas programming library. Pandas includes another object called a Series which is quite similar to a NumPy array.

Pandas includes some very simple syntax that allows you to easily turn a column from a pandas DataFrame into a standalone pandas Series. You simply pass in the name of the column in square brackets at the end of a DataFrame’s variable name.

As an example, here is how you would create a Series called my_new_series that is equal to the chlorides column of the wine quality database:

my_new_series = data['chlorides']

Now that this new pandas Series has been created, you can manipulate the series using the various pandas Series methods included in the library.

How To Work With Missing Data in Pandas

One of the prime frustrations of data scientists is the problem of missing data. Fortunately, pandas makes it very easy to handle missing data when they are contained in pandas DataFrames.

First, the dropna method can be used to drop every row that contains missing values. Here’s how you would do this with the data DataFrame we created earlier in this lesson:

data.dropna()

If you want to drop columns (instead of rows), you can pass in axis=1 as an argument of the dropna method, like this:

data.dropna(axis=1)

You can also use the fillna method to replace missing values with a predetermined value. There are two main approaches to using the fillna method.

The first approach is to replace missing values with a static value, like 0. Here’s how you would do this in our data DataFrame:

data.fillna(0)

The second approach is more complex. It involves replacing missing data with the average value of either:

  • The entire DataFrame
  • A specific column of the DataFrame

Here’s how you would replace data’s missing values with the average value of the entire DataFrame:

data.fillna(data.mean())

If you wanted to replace the missing values in data with the average values of a specific column (say, volatile acidity), then you could do so with the following code:

data.fillna(data['volatile acidity'].mean())

Pandas DataFrames Common Operations

Now that you have imported data and dealt with its missing data, let’s discuss a few ways to manipulate data stored in pandas DataFrames.

First, you may want to sort a pandas DataFrame based on its values in a particular column. The sort_values method is ideal for this. For example, we can sort our data DataFrame based on the column density as follows:

data.sort_values('density')

There are also several methods that apply specifically pandas Series (which are just columns of a DataFrame, remember):

  • unique: eliminates duplicate entries in a pandas Series
  • nunique: counts the number of unique entries in a pandas Series
  • value_counts: allows you to count the number of times a specific observation occurs in a pandas Series
  • apply: allows you to apply an outside function to every element of a pandas Series

Final Thoughts

In this tutorial, you learned how to import and manipulate large datasets in Python using pandas. Please feel free to refer back to this tutorial if you ever get stuck on large datasets in the future.

This article was written by Nick McCullum, who teaches people how to code on his website.