For some reason I can never remember the proper XPath for getting all the descendant nodes (both element and text nodes). I figure if I post it on my blog, I can just look it up whenever I forget (or maybe writing it down will force it permanently into my brain). Here's the XPath expression:
Pretty simple, huh? At first, I thought it was //*|text(), but that doesn't actually work. Neither does //text()|*. Those two XPath expressions aren't even equivalent -- they actually give you different results.
Now for an example! Let's say that you have the following XML:
<html> <head> <title>Converting from Local Time to UTC</title> <link rel="stylesheet" href="../preview.css" type="text/css" /> </head> <body> <div id="meta"> <table> <tr> <td><b>Title:</b></td> <td>Converting from Local Time to UTC</td> </tr> <tr> <td><b>Entry Id:</b></td> <td>None</td> </tr> <tr> <td><b>Labels:</b></td> <td>python, utc, datetime</td> </tr> </table> </div> </body> </html>
Using Python's lxml module, we can write a short script that prints out all the element tags and non-whitespace strings:
from lxml import etree tree = etree.parse(open('temp.xml')) for node in tree.xpath('//*|//text()'): if isinstance(node, basestring): if node.strip(): print repr(node.strip()) else: print '<%s>' % node.tag
Running the above code, we get the following output:
<html> <head> <title> 'Converting from Local Time to UTC' <link> <body> <div> <table> <tr> 'Converting from Local Time to UTC' <tr> <td> <b> <tr> 'Title:' <td> <td> <b> 'Entry Id:' <td> 'None' <td> <b> 'Labels:' <td> 'python, utc, datetime'
As you can see, the XPath expression gives you the element and text nodes in the exact order that they appear in the document.