[Solved-Info] Removing newline to correct OCR formatting

Discuss the word processor

[Solved-Info] Removing newline to correct OCR formatting

Postby davidcoxmex » Tue Sep 08, 2020 6:21 pm

Hi, I am new here. I am answering my own question. I am trying to correct OCR problems, and a typical OCR problem is a new line feed at the end of the line, but that is not the end of the paragraph.
Sample:
If we would bring others to Christ we must turn away from all sin, and worldliness
and selfishness with our whole heart, yielding to Jesus the absolute
lordship over our thoughts, purposes, and actions. If there is any
the direction in which we are seeking to have our own way and not letting
Him have His own way in our lives, our power will be crippled and men
lost that we might have saved


-The paragraph is broken up by lines that are not the end of a paragraph.

PROBLEM:
OO and Libreoffice (I use LO) will not allow you to do a simple
Search: .\n Replace: .\p (RE On Regular Expressions On)

So I worked up a solution, convoluted but a working solution. Since I found this forum and read through and studied a good number of discussions here, I figured I would share my solution. Note: I figured it out myself, and no discussion gave me the answer per se. A lot gave me pieces that I put together myself.

So the key breaking point here is to replace a period and new line feed with a period and paragraph mark. You cannot do that easily.

SOLUTION: Convert the regular expressions to place holders, deal with them in the text (.!?), and then convert them back to paragraph marks then replace the remaining new line feeds to just space.
We are going to make a number of passes through the text searching and replacing things using place holders. <nl> for new line feed. <para> for paragraph. <ppara> for a period paragraph mark. <qpara> for a question mark paragraph mark. and the exclamation mark will replace fine with !<para>.
*Note: Since this solution works on the very important "Regular Expressions" (On or Off) I am putting a note on each search whether the regular expression tick box was on or off for that search and replace. If you get these wrong, it won't work at all. So be careful to observe that indication. Always work on a copy of your original text.

-These S&R are the "replace all" button by the way.
Search: \n Replace: <nl> (Regular Expressions on)
Search: _<nl> Replace: <nl> (RE off) Rem (note a regular space with <nl>) 1st time
Search: _<nl> Replace: <nl> (RE off) Rem (note a regular space with <nl>) 2nd time
-Note: There may be one or more spaces between the period and the new line feed. This tries to correct these cases.
Search: . Replace: <period> (RE off)
Search: <period><nl> Replace: <ppara>
Search: ? Replace: <ques> (RE off)
Search: <ques><nl> Replace: <qpara>
Search: !<nl> Replace: <ipara>
Search: <ppara> Replace: .\n (Re ON)
Search: <qpara> Replace: ?\n (Re ON)
Search: <ipara> Replace: !\n (Re ON)

PROBLEMS WITH THIS SOLUTION
I admit that this is not perfect. The problem I am seeing is that in the case of a chapter or section title, or a list of things, they do not have punctuation at the end usually, and their new line feed mark will be replaced with a space. You will need to go through the text and fix that afterward or before running the macro. A workaround would be to go through the text before running the macro and put periods or regular paragraph marks at the end of these lines. Note that there is no replacement of paragraph marks so that adding paragraphs should be good before running the macro.
As always, work with copies of your text and inspect the finished product several times before using it.

FOR THE BRAVE ONLY!
If you are fixing OCR texts regularly, you should make this into a macro. (Note that I am using LibreOffice in Spanish, but the macro code has to be in English) The way to do this is to open Tools-> Macros->Edit Macros. Then go to My Macros and Dialogs. Click on Module1. That should get you to your macros for OpenOffice or LibreOffice. (My recommendation at this point is to open Windows notepad and click in the right column of OpenOffice/LibreOffice "Macro Edit" and press control E (select all) and then copy all the existing macros you have, and paste it into your open windows notepad. This will back up what you have in case you get errors. If you copy my macro below and something messes up or doesn't work, then copy from Notepad and reinsert in this "Edit Macros" copying over what you have done.)

Continue with putting OCR fixing solution into your Macros...

In the right-hand window, there should be two fixed points, Sub Main at the beginning of this text, any macros you already have, and End Sub at the end. Right after the last End Sub, insert the code below. This will put the OCR linefeed solution into your Macro set. (TO run the macro without installing it in the Main Menu, on the main menu, click on Tools-> Macros->Execute Macro. In the left-hand box, click My Macros->Standard->Module1. When you click on "Module1" the right-hand box should populate with all of your macros. Click on removeOCRlinefeeds macro there and execute.)

***** NOTE ***** Go to the bottom of this post and read how to install the macro to your menu.

Code: Select all   Expand viewCollapse view
sub removeOCRlinefeeds
rem ----------------------------------------------------------------------
rem define variables
dim document   as object
dim dispatcher as object
rem ----------------------------------------------------------------------
rem get access to the document
document   = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")

rem ----------------------------------------------------------------------
dim args1(21) as new com.sun.star.beans.PropertyValue
args1(0).Name = "SearchItem.StyleFamily"
args1(0).Value = 2
args1(1).Name = "SearchItem.CellType"
args1(1).Value = 0
args1(2).Name = "SearchItem.RowDirection"
args1(2).Value = true
args1(3).Name = "SearchItem.AllTables"
args1(3).Value = false
args1(4).Name = "SearchItem.SearchFiltered"
args1(4).Value = false
args1(5).Name = "SearchItem.Backward"
args1(5).Value = false
args1(6).Name = "SearchItem.Pattern"
args1(6).Value = false
args1(7).Name = "SearchItem.Content"
args1(7).Value = false
args1(8).Name = "SearchItem.AsianOptions"
args1(8).Value = false
args1(9).Name = "SearchItem.AlgorithmType"
args1(9).Value = 1
args1(10).Name = "SearchItem.SearchFlags"
args1(10).Value = 65536
args1(11).Name = "SearchItem.SearchString"
args1(11).Value = "\n"
args1(12).Name = "SearchItem.ReplaceString"
args1(12).Value = "<nl>"
args1(13).Name = "SearchItem.Locale"
args1(13).Value = 255
args1(14).Name = "SearchItem.ChangedChars"
args1(14).Value = 2
args1(15).Name = "SearchItem.DeletedChars"
args1(15).Value = 2
args1(16).Name = "SearchItem.InsertedChars"
args1(16).Value = 2
args1(17).Name = "SearchItem.TransliterateFlags"
args1(17).Value = 1073743104
args1(18).Name = "SearchItem.Command"
args1(18).Value = 3
args1(19).Name = "SearchItem.SearchFormatted"
args1(19).Value = false
args1(20).Name = "SearchItem.AlgorithmType2"
args1(20).Value = 2
args1(21).Name = "Quiet"
args1(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args1())

rem ----------------------------------------------------------------------
dim args2(21) as new com.sun.star.beans.PropertyValue
args2(0).Name = "SearchItem.StyleFamily"
args2(0).Value = 2
args2(1).Name = "SearchItem.CellType"
args2(1).Value = 0
args2(2).Name = "SearchItem.RowDirection"
args2(2).Value = true
args2(3).Name = "SearchItem.AllTables"
args2(3).Value = false
args2(4).Name = "SearchItem.SearchFiltered"
args2(4).Value = false
args2(5).Name = "SearchItem.Backward"
args2(5).Value = false
args2(6).Name = "SearchItem.Pattern"
args2(6).Value = false
args2(7).Name = "SearchItem.Content"
args2(7).Value = false
args2(8).Name = "SearchItem.AsianOptions"
args2(8).Value = false
args2(9).Name = "SearchItem.AlgorithmType"
args2(9).Value = 1
args2(10).Name = "SearchItem.SearchFlags"
args2(10).Value = 65536
args2(11).Name = "SearchItem.SearchString"
args2(11).Value = " <nl>"
args2(12).Name = "SearchItem.ReplaceString"
args2(12).Value = "<nl>"
args2(13).Name = "SearchItem.Locale"
args2(13).Value = 255
args2(14).Name = "SearchItem.ChangedChars"
args2(14).Value = 2
args2(15).Name = "SearchItem.DeletedChars"
args2(15).Value = 2
args2(16).Name = "SearchItem.InsertedChars"
args2(16).Value = 2
args2(17).Name = "SearchItem.TransliterateFlags"
args2(17).Value = 1073743104
args2(18).Name = "SearchItem.Command"
args2(18).Value = 3
args2(19).Name = "SearchItem.SearchFormatted"
args2(19).Value = false
args2(20).Name = "SearchItem.AlgorithmType2"
args2(20).Value = 2
args2(21).Name = "Quiet"
args2(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args2())

rem ----------------------------------------------------------------------
rem dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, Array())

rem ----------------------------------------------------------------------
dim args4(21) as new com.sun.star.beans.PropertyValue
args4(0).Name = "SearchItem.StyleFamily"
args4(0).Value = 2
args4(1).Name = "SearchItem.CellType"
args4(1).Value = 0
args4(2).Name = "SearchItem.RowDirection"
args4(2).Value = true
args4(3).Name = "SearchItem.AllTables"
args4(3).Value = false
args4(4).Name = "SearchItem.SearchFiltered"
args4(4).Value = false
args4(5).Name = "SearchItem.Backward"
args4(5).Value = false
args4(6).Name = "SearchItem.Pattern"
args4(6).Value = false
args4(7).Name = "SearchItem.Content"
args4(7).Value = false
args4(8).Name = "SearchItem.AsianOptions"
args4(8).Value = false
args4(9).Name = "SearchItem.AlgorithmType"
args4(9).Value = 0
args4(10).Name = "SearchItem.SearchFlags"
args4(10).Value = 65536
args4(11).Name = "SearchItem.SearchString"
args4(11).Value = "."
args4(12).Name = "SearchItem.ReplaceString"
args4(12).Value = "<period>"
args4(13).Name = "SearchItem.Locale"
args4(13).Value = 255
args4(14).Name = "SearchItem.ChangedChars"
args4(14).Value = 2
args4(15).Name = "SearchItem.DeletedChars"
args4(15).Value = 2
args4(16).Name = "SearchItem.InsertedChars"
args4(16).Value = 2
args4(17).Name = "SearchItem.TransliterateFlags"
args4(17).Value = 1073743104
args4(18).Name = "SearchItem.Command"
args4(18).Value = 3
args4(19).Name = "SearchItem.SearchFormatted"
args4(19).Value = false
args4(20).Name = "SearchItem.AlgorithmType2"
args4(20).Value = 1
args4(21).Name = "Quiet"
args4(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args4())

rem ----------------------------------------------------------------------
dim args5(21) as new com.sun.star.beans.PropertyValue
args5(0).Name = "SearchItem.StyleFamily"
args5(0).Value = 2
args5(1).Name = "SearchItem.CellType"
args5(1).Value = 0
args5(2).Name = "SearchItem.RowDirection"
args5(2).Value = true
args5(3).Name = "SearchItem.AllTables"
args5(3).Value = false
args5(4).Name = "SearchItem.SearchFiltered"
args5(4).Value = false
args5(5).Name = "SearchItem.Backward"
args5(5).Value = false
args5(6).Name = "SearchItem.Pattern"
args5(6).Value = false
args5(7).Name = "SearchItem.Content"
args5(7).Value = false
args5(8).Name = "SearchItem.AsianOptions"
args5(8).Value = false
args5(9).Name = "SearchItem.AlgorithmType"
args5(9).Value = 0
args5(10).Name = "SearchItem.SearchFlags"
args5(10).Value = 65536
args5(11).Name = "SearchItem.SearchString"
args5(11).Value = "<period><nl>"
args5(12).Name = "SearchItem.ReplaceString"
args5(12).Value = "<ppara>"
args5(13).Name = "SearchItem.Locale"
args5(13).Value = 255
args5(14).Name = "SearchItem.ChangedChars"
args5(14).Value = 2
args5(15).Name = "SearchItem.DeletedChars"
args5(15).Value = 2
args5(16).Name = "SearchItem.InsertedChars"
args5(16).Value = 2
args5(17).Name = "SearchItem.TransliterateFlags"
args5(17).Value = 1073743104
args5(18).Name = "SearchItem.Command"
args5(18).Value = 3
args5(19).Name = "SearchItem.SearchFormatted"
args5(19).Value = false
args5(20).Name = "SearchItem.AlgorithmType2"
args5(20).Value = 1
args5(21).Name = "Quiet"
args5(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args5())

rem ----------------------------------------------------------------------
dim args6(21) as new com.sun.star.beans.PropertyValue
args6(0).Name = "SearchItem.StyleFamily"
args6(0).Value = 2
args6(1).Name = "SearchItem.CellType"
args6(1).Value = 0
args6(2).Name = "SearchItem.RowDirection"
args6(2).Value = true
args6(3).Name = "SearchItem.AllTables"
args6(3).Value = false
args6(4).Name = "SearchItem.SearchFiltered"
args6(4).Value = false
args6(5).Name = "SearchItem.Backward"
args6(5).Value = false
args6(6).Name = "SearchItem.Pattern"
args6(6).Value = false
args6(7).Name = "SearchItem.Content"
args6(7).Value = false
args6(8).Name = "SearchItem.AsianOptions"
args6(8).Value = false
args6(9).Name = "SearchItem.AlgorithmType"
args6(9).Value = 0
args6(10).Name = "SearchItem.SearchFlags"
args6(10).Value = 65536
args6(11).Name = "SearchItem.SearchString"
args6(11).Value = "?"
args6(12).Name = "SearchItem.ReplaceString"
args6(12).Value = "<ques>"
args6(13).Name = "SearchItem.Locale"
args6(13).Value = 255
args6(14).Name = "SearchItem.ChangedChars"
args6(14).Value = 2
args6(15).Name = "SearchItem.DeletedChars"
args6(15).Value = 2
args6(16).Name = "SearchItem.InsertedChars"
args6(16).Value = 2
args6(17).Name = "SearchItem.TransliterateFlags"
args6(17).Value = 1073743104
args6(18).Name = "SearchItem.Command"
args6(18).Value = 3
args6(19).Name = "SearchItem.SearchFormatted"
args6(19).Value = false
args6(20).Name = "SearchItem.AlgorithmType2"
args6(20).Value = 1
args6(21).Name = "Quiet"
args6(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args6())

rem ----------------------------------------------------------------------
rem dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, Array())

rem ----------------------------------------------------------------------
rem dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, Array())

rem ----------------------------------------------------------------------
dim args9(21) as new com.sun.star.beans.PropertyValue
args9(0).Name = "SearchItem.StyleFamily"
args9(0).Value = 2
args9(1).Name = "SearchItem.CellType"
args9(1).Value = 0
args9(2).Name = "SearchItem.RowDirection"
args9(2).Value = true
args9(3).Name = "SearchItem.AllTables"
args9(3).Value = false
args9(4).Name = "SearchItem.SearchFiltered"
args9(4).Value = false
args9(5).Name = "SearchItem.Backward"
args9(5).Value = false
args9(6).Name = "SearchItem.Pattern"
args9(6).Value = false
args9(7).Name = "SearchItem.Content"
args9(7).Value = false
args9(8).Name = "SearchItem.AsianOptions"
args9(8).Value = false
args9(9).Name = "SearchItem.AlgorithmType"
args9(9).Value = 1
args9(10).Name = "SearchItem.SearchFlags"
args9(10).Value = 65536
args9(11).Name = "SearchItem.SearchString"
args9(11).Value = "<ppara>"
args9(12).Name = "SearchItem.ReplaceString"
args9(12).Value = ".\n"
args9(13).Name = "SearchItem.Locale"
args9(13).Value = 255
args9(14).Name = "SearchItem.ChangedChars"
args9(14).Value = 2
args9(15).Name = "SearchItem.DeletedChars"
args9(15).Value = 2
args9(16).Name = "SearchItem.InsertedChars"
args9(16).Value = 2
args9(17).Name = "SearchItem.TransliterateFlags"
args9(17).Value = 1073743104
args9(18).Name = "SearchItem.Command"
args9(18).Value = 3
args9(19).Name = "SearchItem.SearchFormatted"
args9(19).Value = false
args9(20).Name = "SearchItem.AlgorithmType2"
args9(20).Value = 2
args9(21).Name = "Quiet"
args9(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args9())

rem ----------------------------------------------------------------------
rem dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, Array())

rem ----------------------------------------------------------------------
rem dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, Array())

rem ----------------------------------------------------------------------
dim args12(21) as new com.sun.star.beans.PropertyValue
args12(0).Name = "SearchItem.StyleFamily"
args12(0).Value = 2
args12(1).Name = "SearchItem.CellType"
args12(1).Value = 0
args12(2).Name = "SearchItem.RowDirection"
args12(2).Value = true
args12(3).Name = "SearchItem.AllTables"
args12(3).Value = false
args12(4).Name = "SearchItem.SearchFiltered"
args12(4).Value = false
args12(5).Name = "SearchItem.Backward"
args12(5).Value = false
args12(6).Name = "SearchItem.Pattern"
args12(6).Value = false
args12(7).Name = "SearchItem.Content"
args12(7).Value = false
args12(8).Name = "SearchItem.AsianOptions"
args12(8).Value = false
args12(9).Name = "SearchItem.AlgorithmType"
args12(9).Value = 1
args12(10).Name = "SearchItem.SearchFlags"
args12(10).Value = 65536
args12(11).Name = "SearchItem.SearchString"
args12(11).Value = "<nl>"
args12(12).Name = "SearchItem.ReplaceString"
args12(12).Value = " "
args12(13).Name = "SearchItem.Locale"
args12(13).Value = 255
args12(14).Name = "SearchItem.ChangedChars"
args12(14).Value = 2
args12(15).Name = "SearchItem.DeletedChars"
args12(15).Value = 2
args12(16).Name = "SearchItem.InsertedChars"
args12(16).Value = 2
args12(17).Name = "SearchItem.TransliterateFlags"
args12(17).Value = 1073743104
args12(18).Name = "SearchItem.Command"
args12(18).Value = 3
args12(19).Name = "SearchItem.SearchFormatted"
args12(19).Value = false
args12(20).Name = "SearchItem.AlgorithmType2"
args12(20).Value = 2
args12(21).Name = "Quiet"
args12(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args12())

rem ----------------------------------------------------------------------
dim args13(21) as new com.sun.star.beans.PropertyValue
args13(0).Name = "SearchItem.StyleFamily"
args13(0).Value = 2
args13(1).Name = "SearchItem.CellType"
args13(1).Value = 0
args13(2).Name = "SearchItem.RowDirection"
args13(2).Value = true
args13(3).Name = "SearchItem.AllTables"
args13(3).Value = false
args13(4).Name = "SearchItem.SearchFiltered"
args13(4).Value = false
args13(5).Name = "SearchItem.Backward"
args13(5).Value = false
args13(6).Name = "SearchItem.Pattern"
args13(6).Value = false
args13(7).Name = "SearchItem.Content"
args13(7).Value = false
args13(8).Name = "SearchItem.AsianOptions"
args13(8).Value = false
args13(9).Name = "SearchItem.AlgorithmType"
args13(9).Value = 1
args13(10).Name = "SearchItem.SearchFlags"
args13(10).Value = 65536
args13(11).Name = "SearchItem.SearchString"
args13(11).Value = "<period>"
args13(12).Name = "SearchItem.ReplaceString"
args13(12).Value = "."
args13(13).Name = "SearchItem.Locale"
args13(13).Value = 255
args13(14).Name = "SearchItem.ChangedChars"
args13(14).Value = 2
args13(15).Name = "SearchItem.DeletedChars"
args13(15).Value = 2
args13(16).Name = "SearchItem.InsertedChars"
args13(16).Value = 2
args13(17).Name = "SearchItem.TransliterateFlags"
args13(17).Value = 1073743104
args13(18).Name = "SearchItem.Command"
args13(18).Value = 3
args13(19).Name = "SearchItem.SearchFormatted"
args13(19).Value = false
args13(20).Name = "SearchItem.AlgorithmType2"
args13(20).Value = 2
args13(21).Name = "Quiet"
args13(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args13())

rem ----------------------------------------------------------------------
dim args14(21) as new com.sun.star.beans.PropertyValue
args14(0).Name = "SearchItem.StyleFamily"
args14(0).Value = 2
args14(1).Name = "SearchItem.CellType"
args14(1).Value = 0
args14(2).Name = "SearchItem.RowDirection"
args14(2).Value = true
args14(3).Name = "SearchItem.AllTables"
args14(3).Value = false
args14(4).Name = "SearchItem.SearchFiltered"
args14(4).Value = false
args14(5).Name = "SearchItem.Backward"
args14(5).Value = false
args14(6).Name = "SearchItem.Pattern"
args14(6).Value = false
args14(7).Name = "SearchItem.Content"
args14(7).Value = false
args14(8).Name = "SearchItem.AsianOptions"
args14(8).Value = false
args14(9).Name = "SearchItem.AlgorithmType"
args14(9).Value = 1
args14(10).Name = "SearchItem.SearchFlags"
args14(10).Value = 65536
args14(11).Name = "SearchItem.SearchString"
args14(11).Value = "<ques>"
args14(12).Name = "SearchItem.ReplaceString"
args14(12).Value = "?"
args14(13).Name = "SearchItem.Locale"
args14(13).Value = 255
args14(14).Name = "SearchItem.ChangedChars"
args14(14).Value = 2
args14(15).Name = "SearchItem.DeletedChars"
args14(15).Value = 2
args14(16).Name = "SearchItem.InsertedChars"
args14(16).Value = 2
args14(17).Name = "SearchItem.TransliterateFlags"
args14(17).Value = 1073743104
args14(18).Name = "SearchItem.Command"
args14(18).Value = 3
args14(19).Name = "SearchItem.SearchFormatted"
args14(19).Value = false
args14(20).Name = "SearchItem.AlgorithmType2"
args14(20).Value = 2
args14(21).Name = "Quiet"
args14(21).Value = true

dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args14())


***Installing the Macro into your menu
Note: the Macro will be called removeOCRlinefeeds

I would recommend you put this macro on your menu. (I am using LibreOffice in Spanish, so the menu names may be a little different)
1. Main menu option - Tools->Personalize
ON LEFT SIDE
2. Click on "Category" dropdown, and select "Macros"
3. My Macros ->Standard->Module1-> removeOCRlinefeeds (highlight or click it)
ON RIGHT SIDE
1. Click the gear icon under Destination
2. At the bottom, click Insert, and Add submenu (name "Macros")
3. Back at the top left-hand side, under destination, click the dropdown, and then find the "Macros" menu at the bottom of the list. Click on that.
4. There are two arrows between the columns, one left and one right. (Make sure the left column is on removeOCRlinefeeds) Click the arrow pointing to the right. and then accept.
5. In OpenOffice or LibreOffice, look at the main menu. Should be a far-right menu option named "Macros". Click on it, and your macro "removeOCRlinefeeds" should be there. Now copy some dummy text with the typical OCR line feeds at the end of the line, and run the macro to check it.
Last edited by davidcoxmex on Mon Sep 21, 2020 1:59 am, edited 1 time in total.
LibreOffice Versi??n: 6.4.4.2 (x64) on Windows 10
davidcoxmex
 
Posts: 2
Joined: Tue Sep 08, 2020 3:36 pm

Re: Removing newline to correct OCR formatting problems

Postby Villeroy » Tue Sep 08, 2020 8:29 pm

Try to replace regex \n with nothing.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04, no OpenOffice, LibreOffice 6.4
User avatar
Villeroy
Volunteer
 
Posts: 28650
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Removing newline to correct OCR formatting problems

Postby esperantisto » Wed Sep 09, 2020 4:49 pm

Sorry, but you are kinda reinventing the wheel. Try OOoFBTools. This extension is designed to handle OCRd texts. There, select Join broken lines/paragraphs for automatic action or Process ends of lines/paragraphs for manual correction.
AOO 4.2.0 / LibO 6.x/7.x / Win 7 / openSUSE Linux Leap 15.1 (64-bit)
esperantisto
Volunteer
 
Posts: 524
Joined: Mon Oct 08, 2007 1:31 am

Re: Removing newline to correct OCR formatting problems

Postby RoryOF » Wed Sep 09, 2020 7:03 pm

I can do this with three or four passes of Find and Replace.
Apache OpenOffice 4.1.7 on Xubuntu 20.04.1 (mostly 64 bit version) and very infrequently on Win2K/XP
User avatar
RoryOF
Moderator
 
Posts: 31540
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: [Info] Removing newline to correct OCR formatting

Postby davidcoxmex » Thu Sep 17, 2020 5:40 pm

Thanks I downloaded the tool and will investigate how it works. Out of the gate, it looks like the interface is in Russian. Will see if it can change to English. But thanks none-the-less. I am correcting OCR texts, and I am noticing that apart from the line feed at the end of the lines, some have hyphenation problems. "complex" is "com plex". So my macro addresses what I am finding and I am correcting those situations. prefixes "com" con, un, des, etc. also suffixes able, tion, etc. I would like a tool to do this for me. But whatever.
LibreOffice Versi??n: 6.4.4.2 (x64) on Windows 10
davidcoxmex
 
Posts: 2
Joined: Tue Sep 08, 2020 3:36 pm


Return to Writer

Who is online

Users browsing this forum: No registered users and 17 guests