XML parsing in Python - GeeksforGeeks (2024)

Last Updated : 28 Jun, 2022

Comments

Improve

This article focuses on how one can parse a given XML file and extract some useful data out of it in a structured way.

XML: XML stands for eXtensible Markup Language. It was designed to store and transport data. It was designed to be both human- and machine-readable.That’s why, the design goals of XML emphasize simplicity, generality, and usability across the Internet.
The XML file to be parsed in this tutorial is actually a RSS feed.

RSS: RSS(Rich Site Summary, often called Really Simple Syndication) uses a family of standard web feed formats to publish frequently updated informationlike blog entries, news headlines, audio, video. RSS is XML formatted plain text.

  • The RSS format itself is relatively easy to read both by automated processes and by humans alike.
  • The RSS processed in this tutorial is the RSS feed of top news stories from a popular news website. You can check it out here. Our goal is to process this RSS feed (or XML file) and save it in some other format for future use.

Python Module used: This article will focus on using inbuilt xml module in python for parsing XML and the main focus will be on the ElementTree XML API of this module.

Implementation:

#Python code to illustrate parsing of XML files

# importing the required modules

import csv

import requests

import xml.etree.ElementTree as ET

def loadRSS():

# url of rss feed

# creating HTTP response object from given url

resp = requests.get(url)

# saving the xml file

with open('topnewsfeed.xml', 'wb') as f:

f.write(resp.content)

def parseXML(xmlfile):

# create element tree object

tree = ET.parse(xmlfile)

# get root element

root = tree.getroot()

# create empty list for news items

newsitems = []

# iterate news items

for item in root.findall('./channel/item'):

# empty news dictionary

news = {}

# iterate child elements of item

for child in item:

# special checking for namespace object content:media

if child.tag == '{http://search.yahoo.com/mrss/}content':

news['media'] = child.attrib['url']

else:

news[child.tag] = child.text.encode('utf8')

# append news dictionary to news items list

newsitems.append(news)

# return news items list

return newsitems

def savetoCSV(newsitems, filename):

# specifying the fields for csv file

fields = ['guid', 'title', 'pubDate', 'description', 'link', 'media']

# writing to csv file

with open(filename, 'w') as csvfile:

# creating a csv dict writer object

writer = csv.DictWriter(csvfile, fieldnames = fields)

# writing headers (field names)

writer.writeheader()

# writing data rows

writer.writerows(newsitems)

def main():

# load rss from web to update existing xml file

loadRSS()

# parse xml file

newsitems = parseXML('topnewsfeed.xml')

# store news items in a csv file

savetoCSV(newsitems, 'topnews.csv')

if __name__ == "__main__":

# calling main function

main()

Above code will:

  • Load RSS feed from specified URL and save it as an XML file.
  • Parse the XML file to save news as a list of dictionaries where each dictionary is a single news item.
  • Save the news items into a CSV file.

Let us try to understand the code in pieces:

  • Loading and saving RSS feed
    def loadRSS(): # url of rss feed url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml' # creating HTTP response object from given url resp = requests.get(url) # saving the xml file with open('topnewsfeed.xml', 'wb') as f: f.write(resp.content)

    Here, we first created a HTTP response object by sending an HTTP request to the URL of the RSS feed. The content of response now contains the XML file data which we save as topnewsfeed.xml in our local directory.
    For more insight on how requests module works, follow this article:
    GET and POST requests using Python

  • Parsing XML
    We have created parseXML() function to parse XML file. We know that XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. Look at the image below for example:

    Here, we are using xml.etree.ElementTree (call it ET, in short) module. Element Tree has two classes for this purpose – ElementTree represents the whole XML
    document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.

    Ok, so let’s go through the parseXML() function now:

    tree = ET.parse(xmlfile)

    Here, we create an ElementTree object by parsing the passed xmlfile.

    root = tree.getroot()

    getroot() function return the root of tree as an Element object.

    for item in root.findall('./channel/item'):

    Now, once you have taken a look at the structure of your XML file, you will notice that we are interested only in item element.
    ./channel/item is actually XPath syntax (XPath is a language for addressing parts of an XML document). Here, we want to find all item grand-children of channel children of the root(denoted by ‘.’) element.
    You can read more about supported XPath syntax here.

    for item in root.findall('./channel/item'): # empty news dictionary news = {} # iterate child elements of item for child in item: # special checking for namespace object content:media if child.tag == '{http://search.yahoo.com/mrss/}content': news['media'] = child.attrib['url'] else: news[child.tag] = child.text.encode('utf8') # append news dictionary to news items list newsitems.append(news)

    Now, we know that we are iterating through item elements where each item element contains one news. So, we create an empty news dictionary in which we will store all data available about news item. To iterate though each child element of an element, we simply iterate through it, like this:

    for child in item:

    Now, notice a sample item element here:

    We will have to handle namespace tags separately as they get expanded to their original value, when parsed. So, we do something like this:

    if child.tag == '{http://search.yahoo.com/mrss/}content': news['media'] = child.attrib['url']

    child.attrib is a dictionary of all the attributes related to an element. Here, we are interested in url attribute of media:content namespace tag.
    Now, for all other children, we simply do:

    news[child.tag] = child.text.encode('utf8')

    child.tag contains the name of child element. child.text stores all the text inside that child element. So, finally, a sample item element is converted to a dictionary and looks like this:

    {'description': 'Ignis has a tough competition already, from Hyun.... , 'guid': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... , 'link': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... , 'media': 'http://www.hindustantimes.com/rf/image_size_630x354/HT/... , 'pubDate': 'Thu, 12 Jan 2017 12:33:04 GMT ', 'title': 'Maruti Ignis launches on Jan 13: Five cars that threa..... }

    Then, we simply append this dict element to the list newsitems.
    Finally, this list is returned.

  • Saving data to a CSV file
    Now, we simply save the list of news items to a CSV file so that it could be used or modified easily in future using savetoCSV() function. To know more about writing dictionary elements to a CSV file, go through this article:
    Working with CSV files in Python

So now, here is how our formatted data looks like now:

As you can see, the hierarchical XML file data has been converted to a simple CSV file so that all news stories are stored in form of a table. This makes it easier to extend the database too.
Also, one can use the JSON-like data directly in their applications! This is the best alternative for extracting data from websites which do not provide a public API but provide some RSS feeds.

All the code and files used in above article can be found here.

What next?

  • You can have a look at more rss feeds of the news website used in above example. You can try to create an extended version of above example by parsing other rss feeds too.
  • Are you a cricket fan? Then this rss feed must be of your interest! You can parse this XML file to scrape information about the live cricket matches and use to make a desktop notifier!

Quiz of HTML and XML



N

Nikhil Kumar

XML parsing in Python - GeeksforGeeks (4)

Improve

Previous Article

Understanding Character Encoding

Next Article

Python - XML to JSON

Please Login to comment...

XML parsing in Python - GeeksforGeeks (2024)

FAQs

What is the best way to parse XML in Python? ›

ElementTree. The ElementTree XML API provides a simple and intuitive API for parsing and creating XML data in Python. It's a built-in module in Python's standard library, which means you don't need to install anything explicitly.

Is it safe to parse XML in Python? ›

Python's interfaces for processing XML are grouped in the xml package. The XML modules are not secure against erroneous or maliciously constructed data. If you need to parse untrusted or unauthenticated data see the XML vulnerabilities and The defusedxml Package sections.

Is it easier to parse XML or JSON in Python? ›

You need to parse XML with an XML parser. JSON is simple and more flexible. XML is complex and less flexible. JSON supports numbers, objects, strings, and Boolean arrays.

Which Python module is best suited for parsing XML documents? ›

The xml. etree. ElementTree module implements a simple and efficient API for parsing and creating XML data.

Which is the fastest XML parser in Python? ›

Benchmarking XML Parsing Speed

As you can see, lxml is by far the fastest XML parsing library, taking only 0.35 seconds compared to over 2 seconds with the built-in xml.

Which is the best XML parser? ›

If you need a simple and easy-to-use XML parser, then xml. etree. ElementTree is a good option. And if you need to parse large XML files, then SAX is a good option.

Does Python have a built-in XML parser? ›

Learn About XML Parsers in Python's Standard Library. In this section, you'll take a look at Python's built-in XML parsers, which are available to you in nearly every Python distribution. You're going to compare those parsers against a sample Scalable Vector Graphics (SVG) image, which is an XML-based format.

Is Python good for file parsing? ›

Python is a versatile programming language known for its simplicity and readability. It has a rich ecosystem of libraries, such as CSV, JSON, XML, and binary file parsers, that make it easy to parse and manipulate files in different formats.

Can I convert XML to JSON? ›

To convert an XML document to JSON, follow these steps:
  1. Select the XML to JSON action from the Tools > JSON Tools menu. ...
  2. Choose or enter the Input URL of the XML document.
  3. Choose the path of the Output file that will contain the resulting JSON document.

Is XML outdated? ›

XML has been around for quite a while and has been used for just about everything. So, to answer the question, yes, people do still use XML! In fact, it remains a popular choice for many applications that require more complex data structures or that need to store data in a way that can be easily searched and analyzed.

Why is parsing slow in XML? ›

The parse method takes all the time because it's waiting on the input from the other application. You need to separate the two so you can see what's going on.

Is JSON replacing XML? ›

Many developers believe that “XML failed and was replaced with JSON,” but this could not be further from the truth. Originally, XML was envisioned to be used for all data interoperability problems, and to this day remains as “the world's most widely-used format for representing and exchanging information.”

How to do XML parsing using Python? ›

There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.

What is the best library to work with XML in Python? ›

ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.

What are the two methods of parsing in XML document? ›

DOM is a tree-based interface that models an XML document as a tree of nodes, upon which the application can search for nodes, read their information, and update the contents of the nodes. SAX is an event-driven interface. The application registers with the parser various event handlers.

How to extract XML data using Python? ›

Load our XML document into memory, and construct an XML ElementTree object. We then use the find method, passing in an XPath selector, which allows us to specify what element we're trying to extract. If the element can't be found, None is returned. If the element can be found, then we'll use the .

What is the JAXB equivalent for Python? ›

PyXB is a pure Python package that generates Python code for classes that correspond to data structures defined by XMLSchema. In concept it is similar to JAXB for Java and CodeSynthesis XSD for C++.

How to parse XML to JSON in Python? ›

To convert XML to JSON using Python, you can use libraries like 'xmltodict' and 'json'. First, parse the XML using 'xmltodict. parse()', then convert the resulting dictionary to JSON using 'json. dumps()'.

Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 5621

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.