In this tutorial, we’ll learn how to extract data from an XML file using Python. XML (eXtensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
XML is widely used for data exchange among various applications and systems over the internet. Let’s explore how we can extract data from an XML file using Python’s built-in ElementTree library.
Example XML file:
<movies> <movie id="1"> <title>The Godfather</title> <year>1972</year> <director>Francis Ford Coppola</director> </movie> <movie id="2"> <title>Pulp Fiction</title> <year>1994</year> <director>Quentin Tarantino</director> </movie> </movies>
Step 1: Import Required Libraries
First, we will import the ElementTree library; it will be used for parsing the XML file and extracting the required data.
1 |
import xml.etree.ElementTree as ET |
Step 2: Load and Parse the XML File
In this step, we will load the XML file and parse its contents using the ElementTree library.
1 2 3 |
# Load and parse the XML file tree = ET.parse('movies.xml') root = tree.getroot() |
The parse()
function reads the XML file, and the getroot()
function returns the root element of the XML tree.
Step 3: Extract Data from the XML File
Now that we have the root element of the XML tree, we can iterate through its child elements and extract the data.
1 2 3 4 5 6 7 |
# Iterate through the child elements and extract data for movie in root: id = movie.attrib['id'] title = movie.find('title').text year = movie.find('year').text director = movie.find('director').text print(f"Movie ID: {id}, Title: {title}, Year: {year}, Director: {director}") |
The attrib
attribute contains the element’s attributes as a dictionary. The find()
function searches for the specified subelement and returns its first occurrence. The text
attribute returns the text content of an element.
Full Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import xml.etree.ElementTree as ET # Load and parse the XML file tree = ET.parse('movies.xml') root = tree.getroot() # Iterate through the child elements and extract data for movie in root: id = movie.attrib['id'] title = movie.find('title').text year = movie.find('year').text director = movie.find('director').text print(f"Movie ID: {id}, Title: {title}, Year: {year}, Director: {director}") |
Output:
Movie ID: 1, Title: The Godfather, Year: 1972, Director: Francis Ford Coppola Movie ID: 2, Title: Pulp Fiction, Year: 1994, Director: Quentin Tarantino
Conclusion
In this tutorial, we’ve learned how to extract data from an XML file using Python’s built-in ElementTree library. This method can be applied to various other XML files to extract information as required.