[Solved] "Find regexp substring" function

Discuss the spreadsheet application
Post Reply
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

[Solved] "Find regexp substring" function

Post by mclaudt »

In Excel there is no direct regexp-related cell function, so if you want to get some regexp substring from, for example, cell A1, you should write your own function in VB:

Code: Select all

Function ReFind(FindIn, FindWhat As String, Optional IgnoreCase As Boolean = False)
    Dim i As Long
    Dim matchCount As Integer
    Dim RE As Object, allMatches As Object, aMatch As Object
    Set RE = CreateObject("vbscript.regexp")
    RE.Pattern = FindWhat
    RE.IgnoreCase = IgnoreCase
    RE.Global = True
    Set allMatches = RE.Execute(FindIn)
    matchCount = allMatches.Count
    If matchCount >= 1 Then
        ReDim rslt(0 To allMatches.Count - 1)
        For i = 0 To allMatches.Count - 1
            rslt(i) = allMatches(i).Value
        Next i
        ReFind = rslt
    Else
        ReFind = ""
    End If
End Function
and then assign your cell =ReFind(A1;"^\d\d")

In OO Basic I didn't find any information concerning regexp find/replace function, there is only one mention about global find/replace method, accessible via menu.

What is the simplest way to get regexp substring from some cell in OOCalc?
Last edited by mclaudt on Thu May 13, 2010 12:29 am, edited 1 time in total.
OpenOffice 3.1 on openSUSE 11.0
mriosv
Volunteer
Posts: 651
Joined: Mon Mar 09, 2009 1:12 am
Location: Galiza (España)

Re: "find regexp substring" function

Post by mriosv »

Menu/Help/OOo help- Index: regular expressions
LibreOffice 3.5.4 AOo-3.4 on Win 7 Ultimate
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: "find regexp substring" function

Post by mclaudt »

Menu/Help/OOo help- Index: regular expressions
ORLY?? Thanks a lot :bravo:

I'm going to use regexp manipulation within a cell, so regexp-oriented function such a MATCH LOOKUP SEARCH don't help.

In A1 I have "Giant Black Monolith h=50, w=20" and I want put in A2 something like =MySuperFunction(A1;"h=\d\d") and get "h=50".
OpenOffice 3.1 on openSUSE 11.0
mriosv
Volunteer
Posts: 651
Joined: Mon Mar 09, 2009 1:12 am
Location: Galiza (España)

Re: "find regexp substring" function

Post by mriosv »

SEARCH() support regular expressions, then you can use to extract the substring with MID()
=MID(A1;SEARCH("h=[:digit:]{2}";A1;1) ;4)
LibreOffice 3.5.4 AOo-3.4 on Win 7 Ultimate
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: "find regexp substring" function

Post by mclaudt »

mriosv wrote:SEARCH() support regular expressions, then you can use to extract the substring with MID()
=MID(A1;SEARCH("h=[:digit:]{2}";A1;1) ;4)
Yes, but in this example I have to know exactly the length of regexp substring! In case of "h[:space:]*=[:space:]*\d\d" I don't know this length cause it is not fixed, it can be "h=50" and "h = 50".

This task seems to be quite obvious, hope there is no need to learn Python or JavaScript scripting in OO to do that.
OpenOffice 3.1 on openSUSE 11.0
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: "find regexp substring" function

Post by mclaudt »

OpenOffice 3.1 on openSUSE 11.0
mriosv
Volunteer
Posts: 651
Joined: Mon Mar 09, 2009 1:12 am
Location: Galiza (España)

Re: "find regexp substring" function

Post by mriosv »

Find enclosed a solution using array formulas.

A1: "Giant Black Monolith h = 50, w=20"

String searched: "h[:space:]*=[:space:]*[:digit:]{2}"

A2: minimun length of searched string
A2: =MIN(IF(ISNUMBER(SEARCH("h[:space:]*=[:space:]*[:digit:]{2}";LEFT(MID(A1;SEARCH("h[:space:]*=[:space:]*[:digit:]{2}";A1;1);999);ROW(B1:B999));1));ROW(B1:B999);999))

A3: shorter searched string
A3: =MID(A1;SEARCH("h[:space:]*=[:space:]*[:digit:]{2}";A1;1);MIN(IF(ISNUMBER(SEARCH("h[:space:]*=[:space:]*[:digit:]{2}";LEFT(MID(A1;SEARCH("h[:space:]*=[:space:]*[:digit:]{2}";A1;1);999);ROW(B1:B999));1));ROW(B1:B999);999)))

A2 and A3 are arrays, so to introduce use Ctrl+Shift+Enter.
Attachments
SearchedStringWithRegularExpressions.ods
(9.57 KiB) Downloaded 523 times
LibreOffice 3.5.4 AOo-3.4 on Win 7 Ultimate
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: "find regexp substring" function

Post by mclaudt »

mriosv wrote:Find enclosed a solution using array formulas.
Thanks, you have really monstrous skills :) It works fine for this fixed example.

But I'm still finding global solution for this task. Cause there is giant overhead, instead of direct addressing to the length of substring found, we have to produce 1000 similar calculation to define that length :crazy:

And also If I want use lookahead and lookbehind in regexp, this trick of defining the length simply will not work.

There must be a general solution, without constructing such a crutch.
OpenOffice 3.1 on openSUSE 11.0
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: "find regexp substring" function

Post by mclaudt »

Here is the same question, without any success

http://www.oooforum.org/forum/viewtopic.phtml?t=91213
OpenOffice 3.1 on openSUSE 11.0
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: "find regexp substring" function

Post by acknak »

It might help if you gave us a sample data set to work with, showing the input data and what you need to get out of it.

It's often the case that a specific problem is much easier than solving the general one. For example, if the substring always falls at the beginning or end of the input string.

Also, are you sure that OOo BASIC has no regular expression search function(s)? I seem to remember that it does, but it's been a long time since I looked at it.

If BASIC doesn't support regexps, Python certainly does.

You probably should ask in the programming area of this forum for help with either of those.

This sort of problem would be very easy if only SUBSTITUTE supported regular expressions. Why the OOo developers went out of their way to add regexps to the search functions and never did it for SUBSTITUTE, I can only guess. Here's a request to add regexp support: Issue 106099: support regular expressions (regexps) in SUBSTITUTE function, maybe some more votes will get the developers to at least open the issue. You can register there and add your vote (up to two) or comment.
AOO4/LO5 • Linux • Fedora 23
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: "find regexp substring" function

Post by mclaudt »

Thanks for your help, I voted for that issue and I also аррeаl folks for voting here.

acknak wrote:It might help if you gave us a sample data set to work with, showing the input data and what you need to get out of it.
This is a column of semi-unstructured definitions such as

Code: Select all

Giant Black Monolith h=50, w=20
Small Green Monolith h=10-11,5, w=2.0
Monolith Pro (H=50); w=20~25
Medium Black Monolith w=20
and I want to get accurate table with 'Name', 'Height', 'Width' headers, so there is no way except using all power of regexp without monstrous detours.

Now I'm looking for user-defined functions with Python. But, anyway, any comments are welcome.
Last edited by mclaudt on Wed May 12, 2010 6:22 pm, edited 1 time in total.
OpenOffice 3.1 on openSUSE 11.0
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: "find regexp substring" function

Post by mclaudt »

acknak wrote:This sort of problem would be very easy if only SUBSTITUTE supported regular expressions.
By the way, even SUBSTITUTE support for regexp will not solve the problem.
It would be nice if simply SUBSTRING (it supports regexp today!) returns substring found, not silly character position that is totally unusable!
Last edited by mclaudt on Wed May 12, 2010 7:30 pm, edited 1 time in total.
OpenOffice 3.1 on openSUSE 11.0
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: "find regexp substring" function

Post by acknak »

By the way, even SUBSTITUTE support for regexp will not solve the problem.
I expect problems do exist that it wouldn't fix, but I haven't seen one yet.

=SUBSTITUTE(A1; ".*([hH]=[0-9][0-9]*).*"; "$1") would give you the height values (if SUBSTITUTE supported regexps!)
=SUBSTITUTE(A1; ".*([wW]=[0-9][0-9]*).*"; "$1") would give you the widths

Obviously, you could do that right now in a few clicks using the find/replace dialog; there's just no way to do it as a formula.
AOO4/LO5 • Linux • Fedora 23
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: "find regexp substring" function

Post by mclaudt »

It seems that I find solution.

Code: Select all

Function regex(a,b,c)

' Attention - made by novice, so can contain bugs

' a - string or cell to search in
' b - regexp string or cell containing regexp string
' c - back-reference number - analogy to \n in regexp syntax

  
' prepare regexp search options
oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
oOptions.searchString = b
oTextSearch.setOptions(oOptions)

' search first substring
oFound = oTextSearch.searchForward(a, 0, Len(a))

If  oFound.subRegExpressions=0 then 
  regex = "No result in that back-reference level"
  Exit Function
Else
  nStart = oFound.startOffset()
  nEnd = oFound.endOffset()
  regex = Mid(a, nStart(c) + 1, nEnd(c) - nStart(c))
End If

End Function
Based on code from this this thread
See also http://api.openoffice.org/docs/common/r ... esult.html
OpenOffice 3.1 on openSUSE 11.0
mclaudt
Posts: 13
Joined: Tue May 11, 2010 11:15 pm

Re: [Solved]"find regexp substring" function

Post by mclaudt »

Noticed that this script works about 3-4 times slower than VBScript analog. It it possible that there is also some overhead.
OpenOffice 3.1 on openSUSE 11.0
Post Reply