Decoding of file in Python

4

I have a file that is completely written so after my crawler also written in Python I have saved the data in it:

b'N\xc3\xa3o n\xc3\xa3o n\xc3\xa3o, n\xc3\xb3s iremos sim!' I wonder if there is any way I can get the encoding out of this file and move to Unicode ASAP! If possible without having to install any program so as not to disturb the performance of my crawler and my execution of this service.

I've tried using bytes.encode and then bytes.decode, but as expected, it returns to the initial state, and I also realized that the strings have no decoding commands.

    
asked by anonymous 21.04.2016 / 00:00

2 answers

1

The prefix " b' " in the representation of your object shows that the text you have at that point in your program is a bytes object, not a text string.

In Python 3 the two things are different - since they invented multi-byte text encodings, it can not be said that a byte is a character.

The normal workflow in any Python application is:

  • get your input data;
  • if the uq library delivered your data no longer delivered as text, that is, if they are bytes, decode them ( decode ) to become text
  • Process your data
  • encodes them again ( encode ) and writes them to the output (if this not done automatically - as with text files, for example)
  • Then in your case, assuming that the object you have there is in the a variable, to continue your program, just decode those bytes to text (object of type str ) in Python 3 and continue your program:

    a = b'N\xc3\xa3o n\xc3\xa3o n\xc3\xa3o, n\xc3\xb3s iremos sim!'
    b = a.decode("utf-8")
    print(b)
    

    In the case, I know that the encoding is utf-8 to look at the encoding: two bytes for an accented character, and the first one being "\ xc3" is a good hint that bytes represent text encoded in utf-8 .

    An essential thing to understand is the difference between text ( str in Python 3) which is composed of unicode characters, and bytes, which are sequences of numbers between 0 and 255 effectively stored in files or transmitted over the network. To do this, be sure to read:

    link

        
    22.04.2016 / 22:17
    0

    A bad thing in Python is working with string , I've already had a lot of headache with unicode and utf-8 . p>

    Going to the point. In interpreted languages, it is often used header comments for "configurations."

    In Python , usually the first line is reserved for the executable.

    #!/usr/bin/python
    

    or

    #!/usr/bin/env python
    

    And the second line usually saves coding information:

    #*-* coding: utf-8 *-*
    #*-* coding: latin-1 *-*
    

    Use only one of them!

    The *-* snippets should not be needed, but I've never used them without them, so I can not tell if it will make a difference.

    With this line, you can use special characters in your code.

    print "maçã é maçã!"
    

    If you want to use unicode , you can use the unicode method to convert your bytes.

    uni_str = unicode("maçã", "utf-8")
    

    But you should have the coding line in all files anyway!

        
    21.04.2016 / 00:30