Need ODT to TXT conversion tips for paragraph numbering

Discuss the word processor

Need ODT to TXT conversion tips for paragraph numbering

Postby NKR8 » Mon Nov 18, 2019 2:01 pm

Hello
I am working on a converter to convert .odt document file into a plain text file that will be read by another Windows application.

I have a below text in ODT file:
Code: Select all   Expand viewCollapse view
1. Title  [Outline level 1]
                       SomeHeading1 [Outline level 9 but not numbering style assigned]
                       Text body of <SomeHeading1> [Outline level Body Text]
    1.1 SubTitle  [Outline level 2]
                       SomeHeading2 [Outline level 9 but not numbering style assigned]
                       Text body of <SomeHeading2> [Outline level Body Text]

I want to convert all these text to a plain text.
I can extract the text but I am not able to extract the numbering information like 1, 1.1 etc.
I do not find this information inside the content.xml or styles.xml

Could you please provide a right way to do this conversion properly?
[I could not upload content.xml or styles.xml of the document as it is a customer document and it is against our company policy]

Thank you in advance.
Last edited by NKR8 on Mon Nov 18, 2019 2:15 pm, edited 1 time in total.
OpenOffice 4.1.7
Operating System: Windows 10
NKR8
 
Posts: 12
Joined: Mon Nov 18, 2019 1:42 pm

Re: Need ODT to TXT conversion tips for paragraph numbering

Postby robleyd » Mon Nov 18, 2019 2:07 pm

Would File | Save As and choose file type Text (.txt) do what you want?
Cheers
David
Apache OpenOffice 420m2(Build:9821) - Slackware 14.2 - 64 bit
LibreOffice 6.0.7.3 - Slackware 14.2 - 64 bit
Apache OpenOffice 4.1.4 - Windows 7 Virtual machine
User avatar
robleyd
Moderator
 
Posts: 3297
Joined: Mon Aug 19, 2013 3:47 am
Location: Murbko, Australia

Re: Need ODT to TXT conversion tips for paragraph numbering

Postby NKR8 » Mon Nov 18, 2019 2:13 pm

No, I want to do it programmatically.
So, I need to understand the behavior of numbering in OpenOffice. I am already referring the OpenOffice 1.2 documentation but It could not explain it well.
I want to know, where should I exactly look into to extract the numbering information like in the example that I posted in my first post.
For your information, there is no numbering.xml available for my document.
OpenOffice 4.1.7
Operating System: Windows 10
NKR8
 
Posts: 12
Joined: Mon Nov 18, 2019 1:42 pm

Re: Need ODT to TXT conversion tips for paragraph numbering

Postby RoryOF » Mon Nov 18, 2019 2:20 pm

Why not print the file, complete with numbering, to a text capture (a pseudo printer)?
Apache OpenOffice 4.1.7 on Xubuntu 18.04.4 (mostly 64 bit version) and very infrequently on Win2K/XP
User avatar
RoryOF
Moderator
 
Posts: 30897
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Need ODT to TXT conversion tips for paragraph numbering

Postby NKR8 » Mon Nov 18, 2019 2:27 pm

What do you mean a pseudo printer?

Actually, the plain text format is defined by an external Windows application which reads it.
So, I must have to extract the text from ODT file and convert it to a plain text format that is read by that external Windows application.
OpenOffice 4.1.7
Operating System: Windows 10
NKR8
 
Posts: 12
Joined: Mon Nov 18, 2019 1:42 pm

Re: Need ODT to TXT conversion tips for paragraph numbering

Postby RoryOF » Mon Nov 18, 2019 2:36 pm

A quick search found this
Code: Select all   Expand viewCollapse view
Follow these steps:
1) Use Control Panel and open the Printers page.
2) Double-click  the Add Printer icon
3) select Local printer and click [Next]
4) Scroll down in the Manufacturers list and select Generic
5) Click Next to create the Generic / Text only printer
6) When you get to the Available Ports screen, select File: Creates a file on disk

7) next page, set the name to TheMagicPrinter
8) skip the test page and click [Finish]

9) Back in the Printers window, right-click the new printer and select properties

10) under the Details tab, click [Add Prot...]
12)Slect Other and click [OK]
13) in the Port Name box that pops up, enter:
  c:\windows\temp\TheMagicFile.txt
And click [OK] until you are out of the dialogs.

Now a printer named TheMagicPrinter will appear in the list of availble printers.  When selected, the printed output will go to a file named c:\windows\temp\TheMagicFile.txt

Your program can then read that file and do whatever it wants with it.


It was at https://www.experts-exchange.com/questions/20076914/Pseudo-Printer-name.html
Apache OpenOffice 4.1.7 on Xubuntu 18.04.4 (mostly 64 bit version) and very infrequently on Win2K/XP
User avatar
RoryOF
Moderator
 
Posts: 30897
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Need ODT to TXT conversion tips for paragraph numbering

Postby NKR8 » Mon Nov 18, 2019 2:56 pm

Thank you for this pointer.
But, we can not provide such solution. We do not want customer to create such pseudo printer.
And our converter is already available to customer and we do not want to do this major change.

What I requested for is just a bug regarding numbering in our converter.
OpenOffice 4.1.7
Operating System: Windows 10
NKR8
 
Posts: 12
Joined: Mon Nov 18, 2019 1:42 pm

Re: Need ODT to TXT conversion tips for paragraph numbering

Postby RoryOF » Mon Nov 18, 2019 3:12 pm

As far as I can see, the numbering is handled dynamically by the output modules of OpenOffice, whether to screen or to print. If that is so, you may need to keep track of the numbering xml requests and recreate the numbering. Examining content.xml of a numbered .odt file did not reveal any sequential numbering for the numbered chapter headers.

 Edit: I think that the text stream is processed by a module saxparse and the processed stream passed to the appropriate output module. 


I am now at the limit of my knowledge of this.
Apache OpenOffice 4.1.7 on Xubuntu 18.04.4 (mostly 64 bit version) and very infrequently on Win2K/XP
User avatar
RoryOF
Moderator
 
Posts: 30897
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Need ODT to TXT conversion tips for paragraph numbering

Postby John_Ha » Wed Nov 20, 2019 3:54 pm

1. Rename the .odt file to a .zip file.
2. Double click the zip file
3. Drag content.xml onto the desktop.
4. Open content.xml in an XML editor and pretty print it.

Can you find what you want in the XML?

Perhaps better is to save the ..odt file as a .pdf which will give an exact copy of the formatted file. You may be able to save as a .txt file. Now use regular expressions to parse the text and process it as required. See the tutorial on regular expressions. Regular expressions are designed to do exactly the sort of thing you want.
LO 6.3.5.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
John_Ha
Volunteer
 
Posts: 7607
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK


Return to Writer

Who is online

Users browsing this forum: No registered users and 11 guests