Accentuation error when saving file in Python

0

I'm not able to save a file with accentuation in python, I've come to ask for a help from you;

import csv


f = open('output.txt', 'w')


data = []

def parse(filename):

    with open(filename, 'r') as csvfile:

        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect=dialect)

        for line in reader:             

            f.write("%s\n" % line)

parse('Soap.csv')
f.close()

I always get strings like:

  

Note \ xe7 \ xf5es

and I would like the output to be like:

  

Comments

    
asked by anonymous 11.04.2017 / 01:23

1 answer

3
So the biggest problem is that you have a list object returned by the iteration in the reader and you are trying to write that list straight into an output text file, converting it to string only with the operand % on line f.write("%s\n" % line) .

This transformation of the list into string (even if you were using the method .format of strings instead of % ), uses the internal representation ( repr ) of each element in the list - not its representation with str ). If it were in Python 3 your code would have worked, because the inner representation for simple accented characters displays the same, instead of escaped encoding ("\ xHH" for the byte-strings in Python 2, \ uHHHH for strings of Python text 3).

However, the very same thing is to code write each string separately from the list - ensuring that Python internally uses the representation given by str - adapting your code, it may look like this - assuming you wanted the output can be read exactly as your code attempts: in each line list strings using Python syntax:

import csv

def parse(filename):

    with open(filename, 'r') as csvfile, open('output.txt', 'w') as f:

        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect=dialect)

        for line in reader:

            f.write("[%s]" % ", ".join("'%s'" % field for field in line)  )

parse('Soap.csv')

Note that I've put up another blatant point in your code that is opening a file for writing in the module body, and closing it in the module body without any error handling, and with the function assuming the file opened as a variable global.

If the file is to be used in more than one function, or in more calls to the same function: (1) create another function to encapsulate all calls that will write to the file; (2) preferably to the "with" command to open (and close automatically) the write file; (3) pass the open file as a parameter, explicitly, to all functions that will use it.

Now, as I mentioned earlier, this code uses Python 2, and it works almost by accident. Why you're dealing with text data - both from your input file and from the output - without trying to decode the read data or encode the written data to a specific encoding. And that kind of thing that makes Python 2 so difficult - people assume that it's "right", but such a "\ xe9" can be an "is" if the encoding is "latin1", or any other character if the coding for Greek, Cyrillic, Hebrew or another language.

In Python 2, the csv module is quite limited for writing with real text - rewrite the manual decoding of each read-write element. In Python 3 it already decodes the text automatically.

So assuming you're reading a CSV file in Latin 1 and want to exit it in utf-8, for example, you can do this:

import csv
INPUT_CODEC = "latin1"
OUTPUT_CODEC = "utf-8"
def parse(filename):

    with open(filename, 'r') as csvfile, open('output.txt', 'w') as f:

        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect=dialect)

        for line in reader:
            line = [field.decode(INPUT_CODEC) for field in line]

            f.write("[%s]" % ", ".join("'%s'" % field.encode(OUTPUT_CODEC) for field in line))

parse('Soap.csv')

In Python 3, you pass the encodings when you open the files, and Python does the decoding and encoding for you. If it does not pass it tries to use appropriate default values from the context of the operating system:

with open(filename, 'r', encoding="latin1") as csvfile, open('output.txt', 'w', encoding="utf-8") as f:

There is yet another question - if your strings contain line breaks and possibly some other characters, these line breaks ("\ n") will go straight into your output file, making them difficult to read - and syntactically invalid as "a Python list per string" - that is, if your CSV has something like 'palavra 1; "batatinha quando nasce\n esparrama pelo chão"; palavra 3 , the "enter" inside the second column will be read correctly by the CSV reader (because of the quotation marks) - and will be saved to your output file.

To avoid this, you can escape the line break and a few more special characters in the output file: this is to convert characters that compromise the structure of the file into surrogate sequences that do not cause problems for the file and are interpreted in the back - one of the ways to read your output file is by doing an "eval" on each line, for example. A safe way is to use the urllib.quote methods to write each string and urllib.unquote - but this will require a trtation in reading, and will generate a file difficult to read and edit by hand. Another way is to just change each "\" real by two "\", and then all "\ n" (a single character with decimal code 10) by "\n" (two characters, the "\" "n") - so when Python does an "eval", it will read the string "\" as a single "\" and will interpret the "\ n" in the text file as a single "new line" character.

    
11.04.2017 / 04:35