How to Import a Corpus in Python

In Python, one of the important steps when working with text data is to start with an appropriate corpus of data.

To analyze, process, or derive meaningful insights from text data, you need to be able to import it first. In this tutorial, we’ll learn how to import a corpus of documents through two Python libraries – NLTK (Natural Language Toolkit) and gensim.

Prerequisites

To follow this tutorial, we need the following Python libraries installed:

You can simply use pip to install these libraries in Python:

Download NLTK Corpus

The NLTK library comes with many corpora, like the Gutenberg Corpus, Web and Chat Text corpus, Brown Corpus, etc. To import any of these, you first need to download them. Here is how you can do it:

Importing NLTK Corpus

After downloading the corpus, you can import and use it in your program as follows:

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Below is an example of how to access and print the text of the first file in the Gutenberg corpus. The raw function is used to obtain the raw text of the corpus:

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a

Importing Your Own Corpus with Gensim

With the gensim library, you can not only import standard corpora, but also your own. You first need to create a TextCorpus object. Below is an example of how to do this:

0 :  This
1 :  a
2 :  document
3 :  is
4 :  another
5 :  Documents
6 :  are
7 :  here

You will get the output where key, value pairs represent the word from the corpus and its token id in the dictionary.

There are many more things you can do with NLTK and Gensim. Exploring these two libraries will greatly aid your text and natural language processing endeavors in Python.

Full code:

Conclusion

Having walked through the process of importing a corpus using NLTK and Gensim in Python, you should now be equipped to start exploring and working with text data. You can perform several text data operations including tokenization, stemming, lemmatization, topic extraction, and many more. Happy coding!