How to remove an element from an XML with Python?

7
The case is that I have a file produced by a Garmin (GPS device for physical exercise) and I want to remove all the fields related to the heart beat to pass the file to an athlete who did the exercise with me. The file is in GPX format and looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" ...>
  <metadata>...</metadata>
  <trk>
    <trkseg>
      <trkpt lon="00" lat="00">
        <ele>000</ele>
        <time>2014-01-01T00:00:00.000Z</time>
        <extensions>
          <gpxtpx:TrackPointExtension>
            <gpxtpx:hr>99</gpxtpx:hr>
          </gpxtpx:TrackPointExtension>
        </extensions>
      </trkpt>
      ....
      <trkpt ...>
        ...
        <extensions>
          ...
        </extensions>
      </trkpt>
    </trkseg>
  </trk>
</gpx>

The system basically generates a <trkpt> element at each reading (geographical + physiological + other devices). I need to remove all instances of the <extensions> element within <trkpt> (that is, all content of it). I tried using the ElementTree library with the following code:

import xml.etree.ElementTree as ET
tree = ET.parse('input.gpx')
root = tree.getroot()
for ext in root[1][2].iter('{http://www.topografix.com/GPX/1/1}trkpt'):
  ext = trkpt.find('{http://www.topografix.com/GPX/1/1}extensions')
  root.remove(ext)
tree.write('output.gpx')

The code even removes the elements, but I did not like 3 things here:

The first is that the library adds the XML schema URLs to the element names. I lost a lot of time without understanding why my algorithm did not find the elements ...

The second is this root[1][2] to have a pointer to the parent of the elements that I want to remove. I would be able to access the elements directly by invoking root.iter('{...}extensions') .

And finally, the more serious question is that when writing the result in the file I noticed that the library renames the tags breaks the original format. The result looks like this:

<?xml version='1.0' encoding='UTF-8'?>
<ns0:gpx ...>
  <ns0:metadata>...</ns0:metadata>
  <ns0:trk>...</ns0:trk>
</ns0:gpx>

As I have no experience with this library, perhaps some configuration I did not see in my superficial reading is missing documentation . I'm then looking for a solution to my problem with this or another library.

    
asked by anonymous 09.01.2014 / 19:44

4 answers

4

I followed the hint left in the question comments and solved the problem using the BeautifulSoup 4 library ( thanks @ thiago-silva)

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('input.gpx'), 'xml')
for ext in soup.find_all('extensions'):
  removed = ext.extract()

output = open('output.gpx','w')
output.write(soup.prettify())
output.close()
    
10.01.2014 / 19:13
2

I recommend using lxml library for its performance and simplicity:

from lxml import etree

gpx = etree.parse(open('input.gpx'))

for node in gpx.xpath('//trkpt/extensions'):
    node.getparent().remove(node)

gpx.write(open('output.gpx', 'w'))

I used XPath to simplify things.

    
12.01.2014 / 05:19
1

The easiest way to tinker with XML that I've found until today was to use xmltodict , that's not to say it's performative.

Here's how to use it:

doc = xmltodict.parse("""
<mydocument has="an attribute">
<and>
<many>elements</many>
<many>more elements</many>
</and>
<plus a="complex">
element as well
</plus>
</mydocument>
""")

print doc['mydocument']['@has']
del doc['mydocument']['and']
unparse(doc)

After deleting the node with del you make a unparse() and it generates XML!

    
29.01.2014 / 15:24
0

I made a test here with your code and the 'extensions' elements were not removed (maybe because they are not root's children?). Anyway, the only difference I noticed is that your source file is in the encoded in utf8 and the output you encode in ascii (second the ElementTree documentation , the encoding pattern in the write method is asc). Try to use the encoding in utf8 and see if the result is more appropriate.

The code I used here (and actually removed the desired items) looks like this:

import xml.etree.ElementTree as ET
tree = ET.parse('input.gpx')

for node in tree.iter():
    for child in node:
        if child.tag == 'extensions':
            node.remove(child)

tree.write('output.gpx', encoding='UTF-8')
    
09.01.2014 / 20:57