Handling a large volume of data in Excel can be challenging, especially when you exceed the row limit of 1,048,576. Fortunately, Python’s data manipulation libraries, specifically pandas, can help us handle large datasets effectively.
In this tutorial, you will learn how to manage more than 1,048,576 rows in an Excel sheet with Python’s pandas library.
Step 1: Install Pandas Library
First, make sure you have Python 3 installed on your system. You can check this by running the following command in your terminal or command prompt:
1 |
python --version |
Next, you need to install the pandas library. Run the following command:
1 |
pip install pandas |
Step 2: Read the Large Dataset
In this step, we will read the large dataset in chunks, which helps manage memory usage. For this tutorial, let’s assume you have a CSV file named “large_dataset.csv” with more than 1,048,576 rows.
1 2 3 4 5 6 7 |
import pandas as pd chunksize = 500000 filename = 'large_dataset.csv' # Read large csv file in chunks reader = pd.read_csv(filename, chunksize=chunksize) |
Step 3: Process the Data in Chunks
Once you have the dataset in chunks, you can process each chunk separately. For example, suppose you want to filter only the rows that contain a specific value. You could do this in the following way:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def process_chunk(chunk): # Filter the rows according to your requirements filtered_rows = chunk[chunk['column_name'] == 'filter_value'] return filtered_rows filtered_dataframes = [] for chunk in reader: filtered_chunk = process_chunk(chunk) filtered_dataframes.append(filtered_chunk) # Concatenate all the filtered chunks filtered_data = pd.concat(filtered_dataframes) |
Step 4: Write the Result to a New Excel File
Finally, write the filtered data to a new Excel file. First, we need to install the openpyxl
library, which allows writing data to Excel files:
1 |
pip install openpyxl |
Now, write the filtered data to a new Excel file:
1 2 |
output_filename = 'filtered_data.xlsx' filtered_data.to_excel(output_filename, index=False) |
Full Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
import pandas as pd # Step 1: Install Pandas and Openpyxl libraries # pip install pandas # pip install openpyxl # Step 2: Read the large dataset chunksize = 500000 filename = 'large_dataset.csv' reader = pd.read_csv(filename, chunksize=chunksize) # Step 3: Process the data in chunks def process_chunk(chunk): # Filter the rows according to your requirements filtered_rows = chunk[chunk['column_name'] == 'filter_value'] return filtered_rows filtered_dataframes = [] for chunk in reader: filtered_chunk = process_chunk(chunk) filtered_dataframes.append(filtered_chunk) # Concatenate all the filtered chunks filtered_data = pd.concat(filtered_dataframes) # Step 4: Write the result to a new Excel file output_filename = 'filtered_data.xlsx' filtered_data.to_excel(output_filename, index=False) |
Conclusion
In this tutorial, you learned how to handle more than 1,048,576 rows in Excel using Python’s pandas library. By reading the dataset in chunks and processing it chunk by chunk, you can effectively manage large-scale datasets and perform complex data manipulations.