Home > Code, Tips and tricks > Get XML element value in Python using minidom

Get XML element value in Python using minidom

Finally, a “development” post for my “developer” blog.

Recently, I’ve been working on some XML processing programs in Python.

The minidom module is great if you want your XML in a tree, and want tag names and attributes easily accessible, but, what happens if you want the text content inside a tag?

DOM, does not have a “tag value” concept. Instead, every bit of text in the XML, including the indentation is a “text node”, which is parsed as a separate tree element.

That means, that if you have something like this:


<name>John Smith</name>

You will get a tree with two levels: top level for “name” element, for which nodeValue will be None. This element will have a child node (second level of the tree) which will be of type TEXT_NODE an it’s values will be the text “John Smith”.

So far, so good, but, what if the value we want has some XML markup of its own?


<text>This text has <b>bold</b> and <i>italic</i> words.</text>

Now we have a complex tree on our hands with 3 levels and multiple branches.

It will look something like this:

<text>
   |______
          |-"This text has
          |-<b>
          |  |_________
          |            -"bold"
          |-"and"
          |-<i>
          |  |_________
          |            -"italic"
          --"words."

As you can see, this is a big mess, with the text split in to multiple parts on two separate tree levels.

There is no facility in minidom, to get the value of our <text> tag directly.

There is however, a way around it, that is simple but not obvious: you need to “flatten” the desired tag in to an XML string, then strip the tag it self from the string and you will have a clean value.

Here is the code:

def get_tag_value(node):
    """retrieves value of given XML node
    parameter:
    node - node object containing the tag element produced by minidom

    return:
    content of the tag element as string
    """

    xml_str = node.toxml() # flattens the element to string

    # cut off the base tag to get clean content:
    start = xml_str.find('>')
    if start == -1:
        return ''
    end = xml_str.rfind('<')
    if end < start:
        return ''

    return xml_str[start + 1:end]

Just pass the node you want the value of to the function and it will give you back the value as a string, including any internal markup.

I place this code in the public domain, which means you can use it anywhere any way you want with no strings attached.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s