In this tutorial, we will be learning how to remove stop words from a CSV file using Python. Stop words are typically the most common words in a language like ‘is’, ‘the’, ‘and’, etc., which do not contain important meaning and are often removed from texts.
We will be using the popular data manipulation package in Python called “pandas” and “nltk” library for natural language processing.
Step 1: Importing Required Libraries
First, we need to import the necessary libraries, i.e. Pandas to read and manipulate our CSV data, and nltk to remove stop words.
1 2 3 4 |
import pandas as pd import nltk from nltk.corpus import stopwords nltk.download('stopwords') |
The above script will import pandas and nltk libraries and also download the latest stopwords. If you already have the nltk library, you can skip nltk.download(‘stopwords’)
Step 2: Load the CSV Data
For this tutorial, suppose we have a CSV file with two columns: ID and Text. The CSV file is named “text_data.csv”. Here’s a snippet of how the CSV file looks like:
ID,Text 1,This is the first document. 2,This document is the second document. 3,And this is the third one. 4,Is this the first document.
To load the CSV file, we use the pd.read_csv() function.
1 |
df = pd.read_csv('text_data.csv') |
Step 3: Removing Stop Words
Next, we define a function to remove stop words.”set(stopwords.words(‘english’))” will generate a set of English language stop words.
1 2 3 4 |
def remove_stop_words(text): stop_words = set(stopwords.words('english')) no_stopword_text = [w for w in text.split() if not w in stop_words] return ' '.join(no_stopword_text) |
Now we apply this function to every row of the Text column in our DataFrame.
1 |
df['Text'] = df['Text'].apply(lambda x: remove_stop_words(x)) |
Step 4: Save the Processed Data
Finally, we can save our Processed data into a new CSV file.
1 |
df.to_csv('processed_text_data.csv', index=False) |
Setting index=False
means that Pandas will not include the index column in the output CSV file, resulting in a CSV file with only the data columns. This can make the exported CSV file cleaner and more suitable for certain use cases.
Complete Code
Here’s the complete code summarized:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import pandas as pd import nltk from nltk.corpus import stopwords nltk.download('stopwords') df = pd.read_csv('text_data.csv') def remove_stop_words(text): stop_words = set(stopwords.words('english')) no_stopword_text = [w for w in text.split() if not w in stop_words] return ' '.join(no_stopword_text) df['Text'] = df['Text'].apply(lambda x: remove_stop_words(x)) # Save the processed data without the index and comma df.to_csv('processed_text_data.csv', index=False) |
Result in the ‘processed_text_data.csv’ file.
ID,Text 1,This first document. 2,This document second document. 3,And third one. 4,Is first document.
Conclusion
In this article, you learned how to remove stop words from a CSV file using Python and save it for future use. This is one of the basic steps in any Natural Language Processing task.
Remember that the stop words from the NLTK library may not be suited for every task and sensitivity to context is important.