Regex - Remove between tag and start of class name and last tag closing

0

I have a string (html) and I need to remove everything that is between the first occurrence of <div class="c and first closing of the > tag and last closing of " </div> ". The first one should be this way because the class of this div is dynamically generated, leaving only the first character.

For example: <div class="c2029" style="font-size:45px"><p class="auto">Testando 123...</p></div> should be transformed into <p class="auto">Testando 123...</p>

I've tried the following, but you're removing the entire string:

var testString = '<div class="c2029" style="font-size:45px"><p class="auto">Testing 123...</p></div>'
var result = testString.replace(/\<div\_c.*\>/, '');

Edited

If the string has a line break, the solution no longer works:

var testString = '<div class="c892"><h3>Título teste</h3>
Descrição após quebra de linha.</div>'
var result = testString.replace(/<div class="c.*?>(.*?)<\/div>/, '$1');

console.log(result);

JSFiddle

As Pedro had informed in his own reply, he only added [\s\S] and had the following result:

var result = testString.replace(/<div class="c.*?>([\s\S]*?)<\/div>/, '$1');
    
asked by anonymous 01.02.2018 / 14:50

1 answer

3

Although we know very well what the classic answer for people trying to render HTML using regular expressions, we also have the next answer in the same question , which adds an interesting point.

For point-in-time cases where I need to extract or work some data simply in an HTML text, it's often much faster and more practical to produce a regular expression that does the work for me than using an HTML parser. I have no problem using regex in this kind of situation.

Clarified this, the answer:

var testString = '<div class="c2029" style="font-size:45px"><p class="auto">Testing 123...</p></div>'
var result = testString.replace(/<div class="c.*?>(.*?)<\/div>/, '$1');

console.log(result);

The regular expression itself:

<div class="c.*?>(.*?)<\/div>

Explanation:

  • <div class="c.*?> - Here a lazy quantifier ( .*? ) is used to capture the initial pattern and stop at the first occurrence of the closing of the > tag.
  • (.*?)<\/div> - We used lazy quantifier again in a capturing group and ended with the closing tag of div .
  • Lastly, we use replace() keeping group 1 obtained in the catch, using the $1 marker.

Update

According to the OP, it seems that the desired response was another, since there are situations where <div> does not appear (which was not specified in the question).

Solution 2:

<div class="c.*?>(((?!<\/div>)[\s\S])*)(<\/div>)?

This regular expression was adjusted so that you could consider the new situation and also the possibility of line breaks.

Demo: regex101.com

Explanation:

  • <div class="c.*?> - This is the beginning of the capture of the specified pattern. captures any text until the closing of the > tag.
  • (((?!<\/div>)[\s\S])*) - This is a bit more complex trick. The (?!<\/div>) pattern is a lookahead that checks if the previous match is not followed by the <\/div> pattern. Then I get the next character that is not a whitespace (given by [\s\S] ), that is, any character after that assertion. It is necessary to first check and capture later, because if it were otherwise ( [\s\S](?!<\/div>) ), the last character before the default that should not be captured would not be captured either. (You can see how this happens by changing the regex101 demo.) In the end, I put this into a catch group and had it repeat the same pattern zero or more times, resulting in: (((?!<\/div>)[\s\S])*) .
  • (<\/div>)? - Finally, I catch the div closing pattern, marking it as optional with the ? quantizer. That way, even if the lock does not exist, there's no problem at all.
01.02.2018 / 16:49