Regular expression, removing values from HTML

Question

Regular expression, removing values from HTML

Navigation

#1 by (4 votes)
#2 by (0 votes)

5

I have an HTML that I need to retrieve the values from a set of <li> .

This is part of HTML:

<ul id="minhas-tags">
 <li><em>Tagged: </em></li>
  <li><a href="/tags/tag1">tag1</a>, </li>
  <li><a href="/tags/tag2">tag2</a>, </li>
  <li><a href="/tags/tag3">tag3</a>, </li>
  <li><a href="/tags/tag4">tag4</a>, </li>

I want to get the content of <li> like tag1 , tag2 , etc.

After much reading here I came up with this regular expression:

tags/[a-zA-Z]+">[a-zA-Z]+<+

This can isolate the HTML I want from all the rest, but I do not know how to transform this expression so that it finds the values and returns only the contents of <li> .

This expression returns me for example: /tags/tag1">tag1< , and I only want tag1 .

How would I do this? And could you explain how the suggested expression would work, please?

Update

Sorry, I did not put the language, I'm using C #, my routine is something like this:

public string retorna_Tags_HTML(string html)
{
    Regex ER = new Regex(@"tags?([\w]+)<\/a>", RegexOptions.None);
    Match m = ER.Match(html);
}

html c# regex

asked by anonymous 29.05.2015 / 02:52

2 answers

0

var html = document.querySelector("#minhas-tags").innerHTML;
var conteudo = [];
html.replace(/tag[0-9]*">([a-zA-Z0-9]*)<\/a>/gi, function($1, $2) {
  conteudo.push($2);
});
alert(conteudo);

<ul id="minhas-tags">
  <li><em>Tagged: </em>
  </li>
  <li><a href="/tags/tag1">tag1</a>,</li>
  <li><a href="/tags/tag2">tag2</a>,</li>
  <li><a href="/tags/tag3">tag3</a>,</li>
  <li><a href="/tags/tag4">tag4</a>,</li>
</ul>

29.05.2015 / 03:07

Python 3.4 and Python 2.7: How to remove the preinstalled version of MacBook? What are the SQL Server Express and MySQL limits?

score 4 · Accepted Answer

You can use the expression tags?\w+(?=<\/a>) , which will capture any word (between az , < in , 0-9 and the bottom trace _ ) that is before </a> using # positive ?= .

using System.Text.RegularExpressions;
using System.Linq;
....

string html = 
    @"<ul id=""minhas-tags"">
       <li><em>Tagged: </em></li>
        <li><a href=""/tags/tag1"">tag1</a>, </li>
        <li><a href=""/tags/tag2"">tag2</a>, </li>
        <li><a href=""/tags/tag3"">tag3</a>, </li>
        <li><a href=""/tags/tag4"">tag4</a>, </li>";

  Match[] tags = Regex.Matches(html, @"tags?\w+(?=</a>)")
                   .Cast<Match>()
                   .ToArray();

  foreach (var tag in tags) {
        Console.WriteLine(tag.Value);
  }
  Console.ReadLine();

View demonstração

Another way would be to use a < in> parser , such as HTML Agile Pack to extract these information, see an example:

string html = 
    @"<ul id=""minhas-tags"">
       <li><em>Tagged: </em></li>
         <li><a href=""/tags/tag1"">tag1</a>, </li>
         <li><a href=""/tags/tag2"">tag2</a>, </li>
         <li><a href=""/tags/tag3"">tag3</a>, </li>
         <li><a href=""/tags/tag4"">tag4</a>, </li>";

var documento = new HtmlAgilityPack.HtmlDocument();
documento.LoadHtml(html);

foreach (var tag in documento.DocumentNode.SelectNodes("//a")) {
      Console.WriteLine(tag.InnerText);
}
Console.ReadLine();
// tag1
// tag2
// tag3
// tag4

Note : You must reference the HTML Agile Pack on project.