Regex in word XML

2

I have an xml that came from a docx in this format:

<w:p w:rsidR="00AE2D8E" w:rsidRPr="00AE2D8E" w:rsidRDefault="00AE2D8E">
    <w:pPr>
        <w:rPr>
            <w:lang w:val="en-US"/>
        </w:rPr>
    </w:pPr>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t xml:space="preserve">Lorem ipsum dolor sit </w:t>
    </w:r>
    <w:proofErr w:type="spellStart"/>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:b/>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t>amet</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd"/>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:b/>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t xml:space="preserve"> </w:t>
    </w:r>
    <w:proofErr w:type="spellStart"/>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:b/>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t>consecteur</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd"/>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:b/>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t>.</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
</w:p>

What is written in the docx is "Lorem ipsum dolor sit amet consecteur.", but it ends up breaking due to font differences, bold, etc.

The problem is that I need to replace the text "Lorem ipsum dolor sit amet consecteur." by any other.

Does anyone know how to do this by regex? It's possible? If not, what other viable option?

EDIT: So, my goal is to replace the text "Lorem ipsum dolor sit amet consecteur." by other text. The problem is that in the middle of it, because of docx xml, text formatting orientation tags are created (,). The regex I have here is:

\bLorem ipsum dolor sit amet consecteur.\b

This regex does not find the phrase because of the codes in the middle, ideally it should replace it by ignoring the codes in the middle.

    
asked by anonymous 27.08.2018 / 22:04

1 answer

0

The best way to capture a text in the case of your XML is to use the open and close tags as the capture delimiter, that is, to capture anything that is outside the tags, initiating capture on any character from the closing of the tag > and delimiting the capture until the opening of another tag < .

The following regex does just that:

>([A-zÀ-ÿ.,:?! ]{1,})<|>([ A-zÀ-ÿ.,:?!]{1,})\n

You can see how this regex works here .

Regex explanation:

  • >([A-zÀ-ÿ.,:?! ]{1,})< - delimits that regex will start capturing from the character < , from there we have a ([A-zÀ-ÿ]{1,}) capture group, it will capture 1 or more letters, numbers, spaces or punctuation, as long as those characters have then the opening of another tag ie until <
  • | - is an OR operator, it indicates that this regex can accept the previous pattern or the pattern after the delimiter
  • >([ A-zÀ-ÿ.,:?!]{1,})\n - does the same thing as group 1, but its delimiter is the line break, for cases where the text is the last thing on the line until the tag is opened on the next line.
31.08.2018 / 02:53