In this tutorial, we’ll learn how to remove stop words from a text file in Python without using the Natural Language Toolkit (NLTK) library. Stop words are commonly used words such as “a”, “an”, and “the” that do not carry significant meaning and are often removed from the text during text preprocessing.
Step 1: Read the Text File
In this step, we’ll read the text file, store its content in a variable, and convert it to lowercase. Make sure the file is in the same directory as your Python script or provide the full path.
This is the content of the file:
Two roads diverged in a wood, and I- I took the one less traveled by, And that has made all the difference.
Create a new Python file and add the following code:
1 2 |
with open('your_file.txt', 'r') as file: text = file.read().lower() |
Replace your_file.txt
with the name of your text file.
Step 2: Tokenize the Text
Now we’ll split the text into a list of words. This process is known as tokenization. We’ll use the split()
method to split the text into words. Splitting makes it easier to access individual words and remove the stop words.
Add the following line of code to your Python file:
1 |
words = text.split() |
Step 3: Create a List of Stop Words
In this step, create a list of stop words you want to remove from your text. It is possible to get stop word lists for different languages on the internet, such as from StopWords ISO.
For this tutorial, we’ll create a small list of common English stop words.
Add the following code to your Python file:
1 |
stop_words = ['a', 'an', 'the', 'and', ',', '.', 'to', 'is', 'in', 'for', 'of'] |
You can customize this list by adding or removing words as per your requirements.
Step 4: Removing Stop Words
Now, iterate through the list of words and filter out the stop words using a list comprehension.
Add the following line of code:
1 |
filtered_words = [word for word in words if word not in stop_words] |
Step 5: Combine Words Back Into Text
Once the stop words are removed from the list of words, we need to combine the words back into a single string.
Add the following line of code:
1 |
clean_text = ' '.join(filtered_words) |
Step 6: Display the Cleaned Text
Finally, let’s display the cleaned text to make sure the stop words have been removed.
Add the following line to your Python file:
1 |
print(clean_text) |
Now, run your code to see the cleaned text. The output should look like this:
example cleaned text without stop words
Full Code
1 2 3 4 5 6 7 8 9 10 11 12 |
with open('your_file.txt', 'r') as file: text = file.read().lower() words = text.split() stop_words = ['a', 'an', 'the', 'and', ',', '.', 'to', 'is', 'in', 'for', 'of'] filtered_words = [word for word in words if word not in stop_words] clean_text = ' '.join(filtered_words) print(clean_text) |
Conclusion
In this tutorial, we have learned how to remove stop words from a text file in Python without using the NLTK library. This can be handy for cleaning and pre-processing text data for a variety of natural language processing (NLP) applications.