In Python, one of the important steps when working with text data is to start with an appropriate corpus of data.
To analyze, process, or derive meaningful insights from text data, you need to be able to import it first. In this tutorial, we’ll learn how to import a corpus of documents through two Python libraries – NLTK (Natural Language Toolkit) and gensim.
Prerequisites
To follow this tutorial, we need the following Python libraries installed:
You can simply use pip to install these libraries in Python:
1 |
pip install nltk gensim |
Download NLTK Corpus
The NLTK library comes with many corpora, like the Gutenberg Corpus, Web and Chat Text corpus, Brown Corpus, etc. To import any of these, you first need to download them. Here is how you can do it:
1 2 |
import nltk nltk.download('gutenberg') |
Importing NLTK Corpus
After downloading the corpus, you can import and use it in your program as follows:
1 2 3 4 |
from nltk.corpus import gutenberg # Get a list of file ids in the Gutenberg corpus print(gutenberg.fileids()) |
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
Below is an example of how to access and print the text of the first file in the Gutenberg corpus. The raw function is used to obtain the raw text of the corpus:
1 2 |
# Print the first 100 characters of the first file print(gutenberg.raw(gutenberg.fileids()[0])[:100]) |
[Emma by Jane Austen 1816] VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a
Importing Your Own Corpus with Gensim
With the gensim library, you can not only import standard corpora, but also your own. You first need to create a TextCorpus object. Below is an example of how to do this:
1 2 3 4 5 6 7 8 9 10 11 |
from gensim import corpora # Define your own corpus my_corpus = ['This is a document', 'This is another document', 'Documents are here'] # Create dictionary from the corpus dictionary = corpora.Dictionary([d.split() for d in my_corpus]) # Print dictionary keys and values for key, value in dictionary.items(): print(key, ': ', value) |
0 : This 1 : a 2 : document 3 : is 4 : another 5 : Documents 6 : are 7 : here
You will get the output where key, value pairs represent the word from the corpus and its token id in the dictionary.
There are many more things you can do with NLTK and Gensim. Exploring these two libraries will greatly aid your text and natural language processing endeavors in Python.
Full code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import nltk nltk.download('gutenberg') from nltk.corpus import gutenberg print(gutenberg.fileids()) print(gutenberg.raw(gutenberg.fileids()[0])[:100]) from gensim import corpora echo $PATH echo $SHELL # Define your own corpus my_corpus = ['This is a document', 'This is another document', 'Documents are here'] # Create dictionary from the corpus dictionary = corpora.Dictionary([d.split() for d in my_corpus]) # Print dictionary keys and values for key, value in dictionary.items(): print(key, ': ', value) |
Conclusion
Having walked through the process of importing a corpus using NLTK and Gensim in Python, you should now be equipped to start exploring and working with text data. You can perform several text data operations including tokenization, stemming, lemmatization, topic extraction, and many more. Happy coding!