Python prefix in string

-1

I made a script to download an attachment from an email, this attachment is an XML file, and I want to save it to a database. But when I get the body of the XML, it comes with the prefix 'b' and therefore the error when saving the XML in the database.

The string going to SQL ends up like this:

INSERT  INTO NFes (xml) VALUES (b'<?xml version...')

The errors are these:

  

"Conflict on operand type: image is incompatible with xml (206)"   "Could not prepare one or more statements. (8180)"

I have tried to change the encoding using str (xml, "utf-8"), for example, that would solve the prefix problem. But an error occurs with the ODBC SQL Server Driver: "XML parsing: Line 1, character 38, can not switch (9402) (SQLParamData) encoding"     

asked by anonymous 09.04.2018 / 20:49

2 answers

-1

My XML has an "encoding" attribute and so the error message said that it was not possible to switch the encoding.

So I did a replace to remove the 'encoding =' utf-8 ''. And to remove the prefix 'b' I just did what I had tried before, I used a str (xml, 'utf-8'). After these changes it was possible to write to the bank normally!

    
11.04.2018 / 21:36
0

The "b prefix" indicates that the object you have at hand is not a text string - but a set of bytes - In Python 3 the two things are fundamentally different, why you always need to know how the text is encoded in the bytes to be able to transpose them into characters. Nowadays it is increasingly common for the text to be in the "utf-8" encoding, but some legacy systems and Windows use "latin-1" encoding - which allows all characters of the Portuguese language to be in a single byte .

Python's "byte" objects have a "decode" method - just call it and the result will be the text string (which is indicated in Python without the prefix 'b'). but in addition to the "decode" method, the str(xml, 'utf-8') call would also do this transformation - the error message changes. Since it is not the Python error saying that there is an invalid utf-8 string, the odds are your XML will be in utf-8 - only ODBC complains about an invalid character: utf-8 supports universal characters - other encodings such as latin1, no - if there are characters in languages with Greek, Russian, Hebrew, or even punctuation marks that are not defined in latin-1, an error will occur, which may well be this one. p>

The remedy would be to force a coding with escaping to pass the data to the driver - but, there is another problem: the function does not accept bytes (the already encoded text). Result: You will have to mutilate the text in Python, replacing all the characters outside of "latin1" with "?", Turning it back into text and then making your call. Then, if there is no other error in the XML should work.

I would recommend contacting anyone who has designed the bank you are feeding to accept universal encoding.

To understand more about these processes, stop now what you are doing and read link

To fix your problem and remove the problematic characters from the text:

An error equivalent to this is what is happening now inside the ODBC code - if you send a text with Cyrillic characters, for example:

In [119]: a = "texto inválido: Ут пауло интерессет темпорибус пер"

In [120]: a.encode("latin-1")
UnicodeEncodeError                        Traceback (most recent call last)

Then - you should: decode your data using utf-8, encode back to latin-1, changing the unknown characters to "?", and decode back to text - there you will have data that can be sent to your bank:

In [122]: dados
Out[122]: b'texto inv\xc3\xa1lido: \xd0\xa3\xd1\x82'

In [123]: dados_str = dados.decode("utf-8").encode("latin1", errors="replace").decode("latin1")

In [124]: dados_str
Out[124]: 'texto inválido: ??'

(The "data" variable in this example is equivalent to what you have there at the beginning: a bytes object representing text encoded in utf-8, with invalid characters in latin-1). If you continue to have the same error não é possível alternar a codificação , expriemn filter out all non-ASCII characters - use "ASCII" instead of "latin-1" in the above code.

    
10.04.2018 / 19:33