Understanding how to handle and manipulate text data is an essential skill for any aspiring programmer.
Today, this tutorial will guide you through the steps necessary to find common words in two or more text documents using Python. This process is often referred to as frequency analysis and is a major pillar of natural language processing (NLP).
Step 1: Import the Necessary Libraries
For this task, we will require the ‘collections’ library in Python. To import it, use the following line of code:
1 |
import collections |
Step 2: Specify Your Text Documents
The next step is to define the text documents that will be analyzed. For the purpose of this tutorial, we’ll work with two predefined text strings:
1 2 |
text1 = "Python is powerful. Python is easy to learn. Python is open." text2 = "Python is a great language. I love Python language. Python is easy!" |
Step 3: Begin the Analysis
First, we need to split the text documents into individual words. We’ll convert all of the text to lowercase to ensure that the word ‘Python’ and ‘python’ are not considered as different words:
1 2 |
words1 = text1.lower().split() words2 = text2.lower().split() |
Step 4: Create Frequency Distributions
To find the most common words in these texts, we need to create frequency distributions by using Python’s ‘collections. Counter’ class:
1 2 |
counter1 = collections.Counter(words1) counter2 = collections.Counter(words2) |
Step 5: Find Common Words
Now, we will find common words in these two text documents using the intersection method:
1 |
common_words = counter1 & counter2 |
Step 6: Display the Common Words
Last, we’ll display the result:
1 |
print(common_words) |
The output should look something like this :
1 |
Counter({'python': 4, 'is': 3, 'easy': 1}) |
Full code:
Here is the full code snippet incorporating all the steps discussed above:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import collections text1 = "Python is powerful. Python is easy to learn. Python is open." text2 = "Python is a great language. I love Python language. Python is easy!" words1 = text1.lower().split() words2 = text2.lower().split() counter1 = collections.Counter(words1) counter2 = collections.Counter(words2) common_words = counter1 & counter2 print(common_words) |
Counter({'python': 3, 'is': 2})
Conclusion
Working with text data and finding common words in Python can seem daunting at first, but with a good grasp of the basics and Python’s powerful libraries, you can easily perform complex operations such as this one.
Keep practicing and exploring different datasets to improve your skills and understanding of text processing in Python.