[Solved] Greedy regex problem

Creating a macro - Writing a Script - Using the API (OpenOffice Basic, Python, BeanShell, JavaScript)
Post Reply
nienberg
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

[Solved] Greedy regex problem

Post by nienberg »

I've been happily coding away, converting a fairly complex MS Word solution into Writer, until I discovered a problem with the way OOo finds text in a Writer document using regular expressions. My research shows that it is well documented that the find is "greedy" and that is my problem. I am searching for text that is enclosed within tags (like html or xml tags). Here is an example:

Code: Select all

Here is an example with <myTag>some tagged text</myTag> and some non-tagged text.  The problem is finding <myTag>individual instances of tagged text</myTag> that occur in the same paragraph without also finding all of the non-tagged text in between.
The basic regex that finds too much because it is greedy is:

Code: Select all

<myTag>.*</myTag>
I've also tried this:

Code: Select all

<myTag>[^<]</myTag>
which partly solves the problem, but fails for the case where tags are nested.

Code: Select all

Here is an example of <myTag>tagged text that also includes <anotherTag>some nested tags</anotherTag> inside it</myTag>.  This won't work with the negated regex above.
Does anyone have any suggestions on how I might proceed? I realize this maybe isn't strictly a Macro/UNO API problem, since it can be demonstrated within Writer itself, but in the end I am using the regex in a Basic routine.

Thanks,
Last edited by nienberg on Wed Sep 30, 2009 6:17 pm, edited 2 times in total.
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
User avatar
Robert Tucker
Volunteer
Posts: 1250
Joined: Mon Oct 08, 2007 1:34 am
Location: Manchester UK

Re: Greedy regex problem

Post by Robert Tucker »

Having found:
It would be nice to be able to switch the find/replace, as well as other places where regexps are used, to be either greedy or not greedy.

Come on, that usually is only a matter of adding [^x] (substitute x for whatever is appropriate) to the regexp. I'd say thats pretty low priority.
at:

http://wiki.services.openoffice.org/wik ... xpressions

and finding that:

Code: Select all

<(myTag).*?\1>
will not work, I guess a solution in OpenOffice may be some way off.
LibreOffice 7.x.x on Arch and Fedora.
nienberg
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

Re: Greedy regex problem

Post by nienberg »

Right, and to make matters worse, now I realize that the search will not work across paragraphs, so if there is a paragraph break inside of a set of tags, then nothing will be found. I guess I have to rethink this. Maybe I can work out a solution that manipulates the odt file directly using perl or something. But I'm open to other suggestions.

Thanks,
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
User avatar
Robert Tucker
Volunteer
Posts: 1250
Joined: Mon Oct 08, 2007 1:34 am
Location: Manchester UK

Re: Greedy regex problem

Post by Robert Tucker »

In Writer, AltSearch it seems will work since as it says on the Help screen:
...subexpression of the type (.*)any or (.+)any are searched for, the shortest matching occurrence is found, contrary to the OOo standard search, which will find the longest matching occurrence. If it is necessary to preserve compatibility, you can delimit the whole search expression with an extra pair of parentheses: ((Mi)?ster). But this will, of course, lose you the chance to cite the subexpression...
It can also search across paragraphs.
LibreOffice 7.x.x on Arch and Fedora.
nienberg
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

Re: Greedy regex problem

Post by nienberg »

Wow! Just when I had given up hope. That looks very promising. I installed the extension and tested manually. It looks like the [::BigBlock::] option does exactly what I need. Now my final question is how to call it from a Basic macro program. I noticed that recording a macro to used the extension doesn't work (it records nothing). Should I be calling the sub directly? The comments in the code are not in english, so it will be a bit of a challenge, but I assume I could figure it out if that is the best approach.

Thanks very much for your help,
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
User avatar
Robert Tucker
Volunteer
Posts: 1250
Joined: Mon Oct 08, 2007 1:34 am
Location: Manchester UK

Re: Greedy regex problem

Post by Robert Tucker »

Afraid I'm not much into macro writing. Perhaps I would be thinking more of pulling AltSearch apart to find out how it interacted with OpenOffice to do what it does – not something I want to do on a whim!
LibreOffice 7.x.x on Arch and Fedora.
nienberg
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

Re: Greedy regex problem

Post by nienberg »

It's a big complicated library with all the comments and variable names in Czech, but I agree that my best bet is to pick it apart until I understand how it works.

Thanks again for your help.
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
bugmenot111
Posts: 17
Joined: Fri Mar 26, 2010 12:14 pm

Re: [Solved] Greedy regex problem

Post by bugmenot111 »

I have the same problem: non-greedy search is not allowed. Have you found any workaround? (I couldn't find anything useful in the AltSearch's source code)
OpenOffice 3.1 on Windows Vista
User avatar
Villeroy
Volunteer
Posts: 31279
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: [Solved] Greedy regex problem

Post by Villeroy »

bugmenot111 wrote:I have the same problem: non-greedy search is not allowed. Have you found any workaround? (I couldn't find anything useful in the AltSearch's source code)
Did he really write a regex extension in a language without regex support?
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
nienberg
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

Re: [Solved] Greedy regex problem

Post by nienberg »

My need was specifically to find tags with text between them, so after studying the AltSearch code I wrote a basic subroutine that searches for the opening tag, then (starting at the location of the opening tag) it finds the next closing tag, then selects the text in between. So it doesn't really use any of the regex capabilities at all, just the search capabilities.
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
Post Reply