Encode problem in Python

1

At some point in my code I get a% var of% of type str containing% with of%.

while doing

var2 = var.encode()
print(var2)

is printed var

The original word would be 'SENTENCE'

When doing SENTEN\u00c7A returns me b'SENTEN\u00c7A'

When doing print('SENTENÇA'.encode()) returns me b'SENTEN\xc3\x87A'

How can I convert my variable print(var == 'SENTENÇA') so that it is equal to 'SENTENCE'? This variable comes from another program, can come another word too, how do I make this generic conversion?

    
asked by anonymous 18.12.2018 / 19:04

1 answer

4

In short:

Your string has been "escaped twice". It has to be read as if it were bytes, and from there, decoded with the codec "unicode_escape". Just do:

var2 = var.encode("latin1").decode("unicode_escape")

Explanation

Your original string var at some point went through a "double-encoding" process - in this process, the unicode character "Ç" - which has code 124 (0xC7 in hexadecimal) "\ u007c" had this sequence "transplanted" into the string. Usually this representation - "" is used only as a way to show more complicated characters when you see the "repr" form of the string, or to place special characters directly through your code in the literal string. The clue to understand this is that when you print the value in bytes of the string, you may notice that the "\" bar was printed in duplicate. Python does this to indicate the presence of a "physical" character of \, and that the bar is not only being used as a marker to modify the next character of the printed sequence

For example rei_preto = "\u265a" is the character for a black chess king. However, doing this in a normal way, the contents of the string will be just that special character, not the 6-character string "\ u265a" - see the ipython prompt:

In [107]: rei = "\u265a"                                                                                 

In [108]: print(rei)                                                                                     
♚

In [109]: len(rei)                                                                                       
Out[109]: 1

So, as I explained above, something in your process has twice applied the "unicode_escape" procedure to your text before reaching the variable "var". The remedy for this is to transparently translate your text to a set of bytes - that is - each character of the string "SENT" is passed without any transformation to a one-byte string of Python 3. This is done with codec "latin1" - all codes from 0 to 255 have a 1 to 1 correspondence between their text representation and their representation in the Latin-1 charset (this includes the entire ASCII table plus the most common accented characters - the ones used including Portuguese). The second step is decode this byte sequence using the special codec "unicode_escape" - this codec finds the occurrences of the type markings \xFF , \uAAAA (and others) used by Python, and the translates to the corresponding character.

That is:

In [128]: b = "SENTEN\u00c7A"                                                                           

In [129]: c = b.encode("latin1")                                                                         

In [130]: c.decode("unicode_escape")                                                                     
Out[130]: 'SENTENÇA'

Update While I was responding you updated the question and described how you are reading this data, with the line:

arquivo = json.loads(sys.argv[2].replace("\", '\\'))

As you can see, this causes the error - exchanging a "\" in the input string for two causes two bars to exist - which Python interprets to be a "physical bar" rather than a "escape indicator". If you simply take that replace from there, the code snippet will probably work.

The way you are using to pass data to the Python script however is by no means reliable - and you should use another mechanism for this. You are passing a JSON-annotated object through SHELL - and the shelf causes ALL JSON delimiters [, {, " (in addition to the white space itself), in a special way. The chance to give something wrong is about 300% (as it already has). A person with solid Shell knowledge and escaping could write code that would do this - I consider myself to be a solid person in Unicode, but the transformations Shell makes with those characters are beyond my reach.

It is best for you to write your data to a temporary file from within PHP and pass only the file name to the Python script - and then "json.load" can read the entire file at once.

A better architecture might be to use a local " redis server - you enter your data there from PHP, and reads from the redis Python process: this would allow whatever you're doing in Python to run as a continuous service rather than a new, shell-initiated process for every page-view (which is typically when PHP will need the services of Python).

    
18.12.2018 / 19:45