Postby HighOnTrombone » Mon Apr 20, 2020 10:56 pm

Hi all,

I'm working on a document using LibreOffice on Windows 10. I saved it as a .docx (my attempts to find a solution so far have taught me the error of doing that — I'll use .odt in the future), but when I reopened it the next day, it was missing most of the text, starting about at the list of tables.

Googling to find a solution has led me to a few things that sound similar, but aren't quite the same. This doesn't seem to be a problem with SAXParse. This bug thread isn't about quite the same problem; instead of the file just being truncated, the images I put in are still there (as are the cross-references throughout the paper and the hyperlinks in what used to be the bibliography), and while I did try checking the document.xml file, XML Copy Editor says it's well-formed and the XML Tools plugin for Notepad++ detects no errors. Trying to open it in MS Word 2010 didn't work — it just told me it couldn't open the file because of an unspecified error in word\document.xml at line 2, column 0.

The information all seems to still be in document.xml, so it should theoretically be fixable, I'm just not sure how. I've uploaded the file here so people can take a look if they want. Thanks in advance for any help you can give me.
Postby Zizi64 » Tue Apr 21, 2020 6:12 am

I saved it as a .docx

Always work in the native, intrernational standard ODF file formats. Save a copy into the foreign file formats at end the editing - if it is necessary.
Postby John_Ha » Tue May 05, 2020 10:46 am

If you unzip the .docx file (rename fred.docx to fred.zip and double click it) you will see \Word\document.xml. It contains the text and XML tags. \Word\Media contains the images.

Open document.xml with Notepad++

If you pretty print document.xml with the XML Add-on it becomes readable.

(Optional: Linearise the XML or you will get loads of tabs in the result.) Go Search > Replace ..., with search argument <[^>]+> and replace argument is blank. Be sure to tick Regular Expressions. Click Replace All. This strips the tags. See text.odt which has the text but all formatting, tables etc has been stripped.

See [Tutorial] Differences between Writer and MS Word files for a description of differences and for why you should always work in, and save Writer files as .odt, Calc files as .ods, Impress files as .odp etc.

Showing that a problem has been solved helps others searching so, if your problem is now solved, please view your first post in this thread and click the Edit button (top right in the post) and add [Solved] in front of the subject.
