These problems seem only to arise in LibreOffice created documents.
Three self-help methods to fix LibreOffice .docx files with SAX parse errors. You only need to use one of them!
1 AOO seems to be able to open these files ...... so download Apache OpenOffice from
http://www.openoffice.org/download/index.html. Create a new user on your PC and install AOO
for that user only. AOO and LO seem to interact in that LO grabs some of the AOO properties and this will completely isolate AOO from LO. Open the .docx file with AOO. Save it as a .odt file. Uninstall AOO and delete the added user.
2 Follow the directions given at ... ...
viewtopic.php?f=101&t=86936&#p403228. This requires you to unzip the .docx file, extract the \word\document.xml file, and remove all the occurrences of the repeated attribute specified in the error message you get when you open the .docx file. Note that there may be more than attribute repeated in the file so you may have to do this for the other repeated attribute(s). Repeated attributes reported here include w:themeShade, w:themeColor and w:cstheme. Files uploaded to this thread have had many (30+?) repeats.
3 Extract \word\document.xml from the .docx file and strip off all the XML tags to leave just the textWindows:Rename the file from fred.docx to fred.ZIP.
Double click fred.ZIP.
Navigate to the \word folder.
Drag document.XML onto the desktop.
- Install Notepad++ and the XML Tools plug-in. Open document.xml with Notepad ++. Go Plugins > XML Tools > Pretty print XML with line breaks. Delete the XML tags leaving just the text.
- Alternatively, Google
pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.
Edit: I had a file with about 30 errors and I had to find them manually using Notepad++.
I downloaded XML Copy Editor and found it much easier to use as it stepped through the file finding each line with an error.
However, XML Copy Editor would not pretty print because of the errors, so I needed to use Notepad++ to pretty print the file which I then saved. I edited the saved file with XML Copy Editor, saved it, and used Notepad++ to re-linearise it.
XML Copy Editor missed some errors when using F2 to step through the file. However issuing the pretty command in XML Copy Editor located these errors. |
Linux:Rename the file from fred.docx to fred.ZIP.
Unzip fred.ZIP - you may need to install a ZIP utility on Linux.
Navigate to the \word folder.
Extract document.xml.
- Install an XML editor. Open document.xml with the XML editor and format it "pretty print". Delete the XML tags leaving just the text.
- Alternatively, Google
pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.
Edit: The easiest way to delete all the XML tags is by using Find and Replace with Regular Expressions. It should work in LO as long as you do not break the character limit for a paragraph (64k in AOO).
It works fine in NotePad++. Open document.xml. Pretty print (it needs the XML Tools plugin - if you don't you will end up with a single paragraph). Go Search > Replace ..., with search argument <[^>]+> and replace argument blank. Tick Regular Expressions. Click Replace All.
All XML tags are deleted and you are left with just the text. |