Remove HTML tags

4

In terms of efficiency and performance, which code is the best way to remove HTML tags in a string?

Option 1:

string ss = "<b><i>The tag is about to be removed</i></b>";
        Regex regex = new Regex("\<[^\>]*\>");
        Response.Write(String.Format("<b>Before:</b>{0}", ss)); // HTML Text
        Response.Write("<br/>");
        ss = regex.Replace(ss, String.Empty);
        Response.Write(String.Format("<b>After:</b>{0}", ss));// Plain Text as a OUTPUT

Font

Option 2:

using System;
using System.Text.RegularExpressions;

/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
    /// <summary>
    /// Remove HTML from string with Regex.
    /// </summary>
    public static string StripTagsRegex(string source)
    {
    return Regex.Replace(source, "<.*?>", string.Empty);
    }

    /// <summary>
    /// Compiled regular expression for performance.
    /// </summary>
    static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

    /// <summary>
    /// Remove HTML from string with compiled Regex.
    /// </summary>
    public static string StripTagsRegexCompiled(string source)
    {
    return _htmlRegex.Replace(source, string.Empty);
    }

    /// <summary>
    /// Remove HTML tags from string using char array.
    /// </summary>
    public static string StripTagsCharArray(string source)
    {
    char[] array = new char[source.Length];
    int arrayIndex = 0;
    bool inside = false;

    for (int i = 0; i < source.Length; i++)
    {
        char let = source[i];
        if (let == '<')
        {
        inside = true;
        continue;
        }
        if (let == '>')
        {
        inside = false;
        continue;
        }
        if (!inside)
        {
        array[arrayIndex] = let;
        arrayIndex++;
        }
    }
    return new string(array, 0, arrayIndex);
    }
}

Font

    
asked by anonymous 23.11.2015 / 20:13

2 answers

2

I made a Fiddle for the first case . Times were:

Compile:    0.062s
Execute:    0s
Memory :    8kb
CPU    :    0.047s

I made a Fiddle for the second case . For the HtmlRemoval.StripTagsRegex() method, the times were:

Compile:    0.109s
Execute:    0s
Memory :    16kb
CPU    :    0.094s

For the HtmlRemoval.StripTagsRegexCompiled() method, the times were:

Compile:    0.063s
Execute:    0.031s
Memory :    16kb
CPU    :    0.109s

For the HtmlRemoval.StripTagsCharArray() method, the times were:

Compile:    1.969s
Execute:    0.016s
Memory :    16kb
CPU    :    0.703s

Conclusion

All are equally effective.

The first is undoubtedly the fastest, but it is not organized as the second.

The tests I did not consider very large strings. For small chains, the test works well. For larger chains, it would be interesting to set up other criteria and other tests.

    
23.11.2015 / 20:31
1

Considering the performance, you can also do the removal of the tags avoiding the use of regular expressions, which greatly increases performance, here is an initial (simple) code.

link

test results:

 Compile:   0.189s 
 Execute:   0s 
 Memory:    0b 
 CPU:       0.016s

It does not exactly the same rule as the regular expression \<[^\>]*\> , because it removes only if there are both tags,

26.11.2015 / 11:21