Working with large datasets can be very challenging and time-consuming especially when trying to gain insights from the data. Random sampling from these datasets is one of the most effective ways that Data Scientists and Analysts have devised to simplify this process.
It reduces the dataset size and makes it manageable without losing the representativeness of the data. In this tutorial, we’ll be guiding you on how to take a random sample from a DataFrame using Python, specifically the Python Data Analysis Library, Pandas.
Prerequisites
Before we begin, you must have Python installed on your computer. You also need to have the pandas library. If you don’t have pandas, you can install it by running this command in your terminal:
1 |
pip install pandas |
Step 1: Importing the necessary library
We will be using pandas to work with data frames. Therefore, we need to import it.
1 |
import pandas as pd |
Step 2: Creating a DataFrame
Next, let’s create a simple DataFrame for the demonstration. A DataFrame is a two-dimensional labeled data structure with columns potentially of different types. DataFrames are generally the most commonly used pandas object.
1 2 3 4 5 |
df = pd.DataFrame({ 'A': range(1, 101), 'B': range(101, 201), 'C': range(201, 301) }) |
Step 3: Taking a Random Sample
To take a random sample from a DataFrame, pandas provide the sample() function. This function returns a random sample of items from an axis of the DataFrame, where the axis can be either the index (default) or the columns. Let’s take a random sample of 5 rows.
1 |
sample_df = df.sample(n=5) |
Step 4: Outputting the Sample Data
Finally, let’s output what we’ve sampled from the DataFrame to see what it looks like.
1 |
print(sample_df) |
The Full Code
1 2 3 4 5 6 7 8 9 10 11 |
import pandas as pd df = pd.DataFrame({ 'A': range(1, 101), 'B': range(101, 201), 'C': range(201, 301) }) sample_df = df.sample(n=5) print(sample_df) |
A B C 61 62 162 262 9 10 110 210 37 38 138 238 46 47 147 247 50 51 151 251
Conclusion
This tutorial walked you through how to take a random sample from a DataFrame using Python’s pandas library, which has built-in functions that make this task painless. By the end of this tutorial, you should be able to efficiently use pandas to work with large datasets.