How to Take a Random Sample from a DataFrame in Python

Working with large datasets can be very challenging and time-consuming especially when trying to gain insights from the data. Random sampling from these datasets is one of the most effective ways that Data Scientists and Analysts have devised to simplify this process.

It reduces the dataset size and makes it manageable without losing the representativeness of the data. In this tutorial, we’ll be guiding you on how to take a random sample from a DataFrame using Python, specifically the Python Data Analysis Library, Pandas.

Prerequisites

Before we begin, you must have Python installed on your computer. You also need to have the pandas library. If you don’t have pandas, you can install it by running this command in your terminal:

Step 1: Importing the necessary library

We will be using pandas to work with data frames. Therefore, we need to import it.

Step 2: Creating a DataFrame

Next, let’s create a simple DataFrame for the demonstration. A DataFrame is a two-dimensional labeled data structure with columns potentially of different types. DataFrames are generally the most commonly used pandas object.

Step 3: Taking a Random Sample

To take a random sample from a DataFrame, pandas provide the sample() function. This function returns a random sample of items from an axis of the DataFrame, where the axis can be either the index (default) or the columns. Let’s take a random sample of 5 rows.

Step 4: Outputting the Sample Data

Finally, let’s output what we’ve sampled from the DataFrame to see what it looks like.

The Full Code

     A    B    C
61  62  162  262
9   10  110  210
37  38  138  238
46  47  147  247
50  51  151  251

Conclusion

This tutorial walked you through how to take a random sample from a DataFrame using Python’s pandas library, which has built-in functions that make this task painless. By the end of this tutorial, you should be able to efficiently use pandas to work with large datasets.