Page 1 of 1

Not able to read special characters by XWordCursor.getString

PostPosted: Thu Apr 09, 2020 5:29 am
by sumit7325
Hi All,
I am successfully able to read word by word from a Docx file using below code snippet, the problem is all special character like ($,{ , }) are ignored by XWordCursor.getString

Code: Select all   Expand viewCollapse view
XComponent xComp = xCompLoader.loadComponentFromURL(
                sUrl, "_blank", 0, propertyValues);


        com.sun.star.text.XTextDocument xTextDocument =
                (com.sun.star.text.XTextDocument) UnoRuntime.queryInterface(
                        com.sun.star.text.XTextDocument.class, xComp);


        XText xText = xTextDocument.getText();

        XSimpleText xSimpleText = UnoRuntime.queryInterface(
                XSimpleText.class, xText);
        XTextCursor xTextCursor = xSimpleText.createTextCursor();

        xTextCursor.gotoEnd(true);

        XTextRange xTextRange = UnoRuntime.queryInterface(
                XTextRange.class, xTextCursor);
        String sString = xTextRange.getString();


        XTextCursor textCursor = xTextRange.getText().createTextCursorByRange(xTextRange.getStart());
        XWordCursor wordCursor = (XWordCursor)
                UnoRuntime.queryInterface(XWordCursor.class, textCursor);


        wordCursor.gotoStart(false);     // go to start of text

        int wordCount = 0;
        String currWord;
        do {
            wordCursor.gotoEndOfWord(true);
            currWord = wordCursor.getString();
            if (currWord.length() > 0) {
                // System.out.println("<" + currWord + ">");
                wordCount++;
                System.out.println(currWord);
            }
        } while( wordCursor.gotoNextWord(false));


the output of the code is for the attached document is

Code: Select all   Expand viewCollapse view
TestingFirstWord
Hello
test
name
employeeno


the expected output should be

Code: Select all   Expand viewCollapse view
TestingFirstWord
Hello
test
${name}
${employeeno}


I found a similar thread also http://openoffice.2283327.n4.nabble.com/XWordCursor-gotoEndOfWord-misbehavior-td2811431.html but not able to get much information out of it.

Any help or clue is appreciated, thanks

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Thu Apr 09, 2020 7:12 am
by Zizi64
Maybe it is depends of the definition of the "word" language unit: the parentheses (even the special parentheses) are not part of a human language "word".
Therefore the expected output are more than "simple words", but the output are "real words".

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Thu Apr 09, 2020 8:07 am
by JeJe
Write your own function to do it how you want.

In Basic a simple split almost gives you the result you want (the exception is the double space)

sts = split("TestingFirstWord Hello test ${name} ${employeeno}"," ")
for i =0 to ubound(sts)
msgbox sts(i)
next

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Thu Apr 09, 2020 8:28 am
by sumit7325
Thank you for the quick reply,
Code: Select all   Expand viewCollapse view
sts = split("TestingFirstWord Hello test ${name} ${employeeno}"," ")
for i =0 to ubound(sts)
msgbox sts(i)
next

in my case it can be single space or can be a tab space also, that is why I am trying to extract word by word irrecpect to any space

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Thu Apr 09, 2020 9:49 am
by JeJe
There are some options for XBreakiterator which you could look at - but its a trivial task writing code to do what you want if the native function doesn't do it... go through a string and start a new word or not depending on what the character is.

https://www.openoffice.org/api/docs/com ... dType.html

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Thu Apr 09, 2020 2:19 pm
by Lupp
I don't know about what you intend to do with your words or if the formatting is of any meaning. Anyway I feel sure that a WordCursor is the wrong tool in this case.
If you can bear stepping down to "stupid Basic" you can try what I would suggest (and probably ennoble it by moving to a different language / IDE). From my point of view the contained (primitive) Basic with its (powerful) bridge to the API is the means of choice for tasks of this simple kind.
Open your file (Actually .docx? Why?) with AOO or LibO. (I don't know about NeoOffice.)
Use Tools>Macros>Organize Macros>...Basic ... to create a Basic module (located in the document or elsewhere).
Insert the following code there.
Code: Select all   Expand viewCollapse view
Sub getExtendedWords()
doc0    = ThisComponent
sd      = doc0.createSearchDescriptor
sd.SearchRegularExpression = True
sd.SearchString ="\S+"
myWords = doc0.FindAll(sd)
u       = myWords.Count - 1
doc1    = StarDesktop.loadComponentFromUrl("private:factory/scalc", "_blank",0, Array())
s1c1    = doc1.Sheets(0).Columns(0)
outRg   = s1c1.getCellRangeByPosition(0,0,0,u)
outRgDA = outRg.getDataArray()
For j = 0 To u
  outRgDA(j)(0) = myWords(j).String
Next j
outRg.setDataArray(outRgDA)
End Sub

...and run it.

The attached file contains the code and a demo. If you (after checking) give permission to execute the code, it's a matter of milliseconds.
Of course you can output the resullts in a different way. I did it this way because it is simple and grants overview.

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Fri Apr 10, 2020 8:01 am
by sumit7325
Thank you Lupp for the response and detailed explanation, my intention with words is, I have to validate each and every word in Docx template against a certain regular expression, for example, a user can upload a Docx template with some placeholders like ${name}, ${employeeNo}, etc, So I have to do 2 step validation first I have to validate whether expression syntax is correct like $${name}, ${employee}}, or missing bracket will consider as a wrong expression and 2 one if placeholder syntax is correct then I have to validate these syntactically valid placeholders to application-specific placeholder list for example if user have added ${test} in template even though it is a valid expression but I have to ignore it because it is not application-specific . So these operations I have to perform in a Java-based web application that is why I need to extract each and every word and stick to java only :knock:

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Fri Apr 10, 2020 9:46 am
by robleyd
Are you aware of https://poi.apache.org/ ? Just in case it might be helpful for you.

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Fri Apr 10, 2020 11:09 am
by sumit7325
Thanks, robleyd for the suggestion but I have tried mostly all libraries out there including Apache POI, DOCX4j, Documents4j, and other python solutions also and under the hood all are using Apache open office only

Re: Not able to read special characters by XWordCursor.getSt

PostPosted: Fri Apr 10, 2020 1:54 pm
by Lupp
sumit7325 wrote:... my intention with words is, I have to validate each and every word in Docx template against a certain regular expression, for example, a user can upload a Docx template with some placeholders like ${name}, ${employeeNo}, etc, So I have to do 2 step validation first I have to validate whether expression syntax is correct like $${name}, ${employee}}, or missing bracket will consider as a wrong expression and 2 one if placeholder syntax is correct then I have to validate these syntactically valid placeholders to application-specific placeholder list for example if user have added ${test} in template even though it is a valid expression but I have to ignore it because it is not application-specific . So these operations I have to perform in a Java-based web application that is why I need to extract each and every word and stick to java only :knock:

Nothing of what you tell here is surprising to me - except the :knock: .
However, I cannot understand your your lack of understanding my proposal...
sumit7325 wrote:Thanks, robleyd for the suggestion but I have tried mostly all libraries out there including Apache POI, DOCX4j, Documents4j, and other python solutions also and under the hood all are using Apache open office only
...and your response to "robleyd".
You obviously are using any successor of OpenOffice.org. Otherwise the "code snippet" you started with would make no sense.
1. You opened your something.docx with it. (That was probably via the java brige, but that's of no meaning here.)
2. You search for "words" using the term in a sense the WordCursor doesn't know.
== Therefore the WordCursor is the wrong tool.
3. In fact you search for "AttempedInsertionOfaPlaceholder".
4. If an unattended run of software (based on java or avaj or whatever) shall do this,
== you need to tell your program what you are looking for, and
== you should try a regular expression describing this concept syntactically.
5. Having found the suspects this way you want to check them for syntactical correctness as placeholders.
(6. An attempt not being correct may require a response ... output ...)
7. The syntactically correct placeholders need to be checked against a semantic(kind of)/restrictive criterion.

Everything can be done by efficient means available via services / interfaces and their methods provided by AOO and (even better probably) by LibreOffice (and probably also by NeoOffice). Your Java has a bridge to the API of whatever you are using. It must be able to have your RAM-representation of a TextDocument to create a SearchDescritor and the like...

As I would see it
Code: Select all   Expand viewCollapse view
regExAttempt = "\$[^}\s]+(\}+)?"  REM Supposed attempt
and
Code: Select all   Expand viewCollapse view
regExCorrect = "(?<=(^|\s))\$\{[^\{}\s]+\}(?=(\s|$))(?!\})"
are reasonable candidates when looking for what you need with the help of a SearchDescripor ...
If I had your lists of acceptable and of mandatory placeholders I might create a "complete" solution in Basic using another hour or two (at most). Equally it must be feasible with any language / IDE claiming righly to come with a sufficient bridge to "our" API. :knock: