Convert string with HTML tags to an array

4

Consider the following string:

var texto = 'Esse é um texto de <span class="red"> teste </span>';

I need to transform the string into an array separating by space, that is:

var palavras = texto.split(" ");

The problem is that the text contains HTML and in this case the resulting array will be:

palavras[0] = 'Esse';
palavras[1] = 'é';
palavras[2] = 'um';
palavras[3] = 'texto';
palavras[4] = 'de';
palavras[5] = '<span';
palavras[6] = 'class="red">';
palavras[7] = 'teste';
palavras[8] = '</span>';

But I need the resulting array to be as follows:

palavras[0] = 'Esse';
palavras[1] = 'é';
palavras[2] = 'um';
palavras[3] = 'texto';
palavras[4] = 'de';
palavras[5] = '<span class="red"> teste </span>';

How to do this using javascript?

    
asked by anonymous 28.08.2018 / 01:15

3 answers

4

You can use DOMParser to parse HTML text. From there, just manipulate HTML to get the elements you need:

// parsing do trecho HTML
var texto = 'Esse é um texto de <span class="red"> teste </span>';
var parser = new DOMParser();
// cria um document com html, header, body, etc
var htmlDoc = parser.parseFromString(texto, "text/html");

// obter o body do HTML
var body = htmlDoc.querySelector('body');

// obter o elemento span
var span = body.querySelector('span');
// remover o span para que sobre só o texto
body.removeChild(span);
// quebrar o texto em um array
var palavras = body.innerHTML.trim().split(' ');
// adicionar o span no array
palavras.push(span.outerHTML);

console.log(palavras);

The code is very specific to the text you have placed. If you have other tags in other positions, obviously you should make the appropriate adjustments.

You can also use the % with% of jQuery function. The idea is the same: parsing and extracting the elements you need.

var texto = 'Esse é um texto de <span class="red"> teste </span>';
var html = $.parseHTML(texto);

var palavras;
$.each(html, function (i, el) {
    if (el.nodeName === '#text') {
        palavras = el.nodeValue.trim().split(' ');
    } else if (el.nodeName === 'SPAN') {
        palavras.push(el.outerHTML);
    }
});

console.log(palavras);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

Again, the code is very specific to your case because you expect text followed by parseHTML . Adapt to other cases if necessary.

Regex, while very cool, is not always the best solution, especially for HTML parsing.

If specific parsers already exist for the type of data you are manipulating, it is preferable to use them.

    
28.08.2018 / 02:21
3

With the same idea presented by hotspot , you can enter the contents of your string arrow function in for replaced with an anonymous function.

    
28.08.2018 / 02:34
0

I was able to solve the problem. I do not know if it's the best approach but it's a solution that works. I look forward to suggestions.

Before separating by space, I stored all the tags and their contents in an array and replaced each position with its index:

var texto = 'Esse é um texto de <span class="red"> teste </span>';

//verifica se existe alguma tag HTML
//output: ['<span class="red"> teste </span>'];
var tags = texto.match(/<(.*?)>.*?<\/(.*?)>/g); 

//se houver alguma tag, substitui pelo index correspondente
if(tags) {
    tags.forEach(function(tag, index) {
      texto = texto.replace(tag, "{"+index+"}");
    });
}

The result of the string before separating by space will be:

texto = 'Esse é um texto de {0}';
var palavras = texto.split(" ");

The result after separating by space will be the following array:

palavras[0] = 'Esse';
palavras[1] = 'é';
palavras[2] = 'um';
palavras[3] = 'texto';
palavras[4] = 'de';
palavras[5] = '{0}';

Now just replace element 5 that has an index with its tag:

palavras.forEach(function(palavra, index) {
    //verifica se o elemento armazena o index correpondente a uma tag
    //e substitui o conteúdo pela tag
    if(palavra.match(/{[0-9]*?}/g))
        palavras[index] = tags[parseInt(palavra.replace(/{|}/g, ""))];
});

The result will then be:

palavras[0] = 'Esse';
palavras[1] = 'é';
palavras[2] = 'um';
palavras[3] = 'texto';
palavras[4] = 'de';
palavras[5] = '<span class="red"> teste </span>';
    
28.08.2018 / 02:13