How to manipulate docx files using python?

0

I created two docx files but I can not write to them (using python) however I can do this with txt files. The question is could I create a docx file and write to it using a python script?

    
asked by anonymous 05.02.2018 / 00:41

1 answer

1

The point is that documents like .docx, the old .doc, or the libreoffice .odt are orders of magnitude more complex than a normal .txt file. Although it is easy to understand how they work (.docx and .odt, not .doc) - Quesado's comment on the question is correct - they are complex files to deal with. Both files are a container of type ".zip", with one or more ".xml" files embedded. (Versions of both files can be produced just as an .xml, not a .zip, but not a widely used format - the same libreoffice took a long time to support unzipped .odt)

In particular it is not the obligation of any programming language to have tools already included for manipulation of these types of files: this type of support is provided by third party modules. Luckily Python has an eco-system of modules that has some tools that can directly handle this, and these modules are easy to install with the "pip" tool.

Note that the two technologies used by docx and odt are indirectly supported by the default Python library - there is the zipfile module that allows you to extract, update, and create zip files, and there is xml.etree that allows you to read and write XML files by treating the tags as appropriate "nodes". The problem is that the structure and what goes into the XML file in each case is complex, regardless of whether you can type in it or not. Just to create a program that can properly open the container .zip from an odt or docx, and parse the inner xml, and being able to update that and create a new version of docx or odt would take most of a programmer's day experienced and motivated.

These modules for updating and creating files of type already do the dirty service - but updating the content of this type of file continues being complicated: you will have to know how to create the type of data in Python that represents the type of data that I want to insert into the file (paragraph, table, etc ...), and how to insert this data into the edited place, but this only after learning to use the library to read the contents of the file.

Probably the "python-docx" module will allow you to do what you want, but I suggest a good look at its documentation before attempting to use it.

Alternatively, if you need to create text with different styles, colors, etc ... it can be much simpler create an HTML document with a suitable style sheet, and use or a command line command, or selenium to run a browser to read, render that html and generate a PDF output. (or HTML with the style sheet can be read directly by MicrosoftWord or LibreOffice itself). HTML + CSS documentation is much easier to read, has thousands of times more examples, and you can still use one of Python's many template tools to check out the "skeleton" and general form of the documents your program needs to generate, and create only the "stuffing" with relevant data in Python code. I would suggest using jinja2 , which is a much-used Python web template package.

    
05.02.2018 / 05:36