Remove space and string breaks in string

-1

I'm doing a WebAPI that generates an XML, this XML is read several times a day, so on the first run it serializes all my XML and saves it to disk, and for 24h it reads from the disk instead of serializing the entire object again.

I do this because it has several accesses, the XML is large some with up to 300mb, and the information can be cached by 24h

The problem is that the description field, I believe it could be 'compressed' or better could try to make a minify in xml before writing it to disk. I'm trying to remove whitespace and line breaks only from this field for now so I've reduced some good megs.

I use Webapi in C #, Redis, MSSQL

Today I'm sending it like this:

    <description><![CDATA[SOBRADO

Área Terreno: 8 x 28
Área Construída: 170m&sup2;

Pavimento Superior:
2 dormitórios sendo 1 dormitorio com armario embutido planejado e um maste
banheiro
jardim de inverno
sacada


Pavimento Térreo:
2 salas
Copa
Cozinha
Corredor lateral
jardim na frente
quintal

Edícula:
1 dormitórios
banheiro
lavanderia
deposito

4 vagas

IPTU R$ 1.200,00 anual]]></description>

I would like to send this:

<description><![CDATA[SOBRADO Área Terreno: 8 x 28    Área Construída: 170m&sup2;...

I use 2 functions to try to clear the code, but it's not quite as it should.

description = Biblioteca.RemoveTroublesomeCharacters(Biblioteca.CorrigeDescricao(imovel.Descricao)),

internal static string CorrigeDescricao(string descricao)
{
    var tab = '\u0009';
    descricao = descricao.Replace("  ", " ");
    descricao = descricao.Replace("=\r\n", "");
    descricao = descricao.Replace(";\r\n", "");
    descricao = descricao.Replace("\t", " ");
    descricao = descricao.Replace(tab.ToString(), "");
    return RemoveHtml(descricao);
}

E

 internal static string RemoveTroublesomeCharacters(string inString)
        {
            if (inString == null) return null;

            var newString = new StringBuilder();
            char ch;

            for (int i = 0; i < inString.Length; i++)
            {
                ch = inString[i];
                // remove any characters outside the valid UTF-8 range as well as all control characters
                // except tabs and new lines
                //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
                //if using .NET version prior to 4, use above logic
                if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
                {
                    newString.Append(ch);
                }
            }
            return newString.ToString();
        }
    
asked by anonymous 05.05.2016 / 21:09

3 answers

2

Dorathoto, I believe that rather than removing the spaces in the string, either compress the entire Response.

The easiest way to do it without configuring it directly in IIS is to install the following NuGet: Microsoft ASP.NET Web API Compression

PM> Install-Package Microsoft.AspNet.WebApi.Extensions.Compression.Server

Then run the following configuration in the StartUp of your WebAPI:

GlobalConfiguration.Configuration.MessageHandlers.Insert(0, 
    new ServerCompressionHandler(
        new GZipCompressor(), 
        new DeflateCompressor())); 

Original Response

    
05.05.2016 / 21:53
2

NEVER treat XML as text. There are terms for those who do that, although they are technical terms and even used in books, would get me banned from here if I used them;)

Instead, encapsulate everything you want to have in XML as an object that is serializable. Then use the XML classes of the Framework to generate the XML when it is to write, or read from a file. This will not only keep the XML compact, but it will ensure good formatting and save you hours of development.

Start with this class: XmlWriter .

    
05.05.2016 / 21:51
2

You can solve the issue of line break and space overrides with REGEX:

>

Removing excesses

pattern : (\s){2,}
replace : $1
  

It will capture spacing characters that will repeat more than twice and replace with one. Note that it replaces the first one found.

Example

'teste de quebra    '
'de linha     '

Applying would look like this:

'teste de quebra de linha '

because it bounded ' \n' and replaced with ' ' , because ' ' was found first

Removing line breaks

pattern : (\n){2,}
replace : $1
  

They are similar, but not equal because it considers only line break, it may be necessary to change to (\ n \ r?) {2}, since Windows some windows IDEs still add carriage return in>.

Example

'quebra de linha     '

'em duas     '

Applying looks like this:

'quebra de linha'
'em duas'
    
05.05.2016 / 23:11