How to Find Common Words in Two Text Files Using Python

In this tutorial, we will learn how to find common words in two text files using Python. The Python language is highly versatile and useful for many tasks, including text analysis. We’ll be utilizing some of its inbuilt methods and libraries to accomplish our task

Python’s built-in libraries have fantastic methods for handling and manipulating file and string data. set() and open() methods are among those that will be of use in this tutorial. We’ll only need your base Python installation for this, so no additional downloads are required.

Step 1: Prepare Your Text Files

First, have your two text files ready. For simplicity, let’s say the files are called ‘file1.txt’ and ‘file2.txt’. Their content could look like this:

This is the first file.
It contains some words.
Some words are common in both files.

And,

This is the second file.
It has some words too.
Some words are common in both files.

Make sure both files are in the same directory as your Python script.

Step 2: Write Your Python Script

Now, we’re going to write a Python script that reads both files, extracts the individual words and identifies the common ones.

The Python code for performing the task is shown below:

Step 3: Understanding the Python Code

In this Python code, we define the function find_common_words(). It takes in two parameters – the names of the two text files you’d like to compare. The open() method is used to read these files.

We use Python’s set() method to get the unique words in each file. The split() method is used in combination with the built-in read() method to separate the words. The ampersand operator is used to get the intersection of these two sets, which is the common words.

Full Code

Output:

{'Some', 'both', 'words', 'are', 'files.', 'in', 'common'}

Conclusion

After following this tutorial, you should now have a basic understanding of how to find common words in two text files using Python. This functionality could be useful in a variety of use cases, such as text analysis, data cleaning, and Natural Language Processing(NLP).