Tidying OCRed text

Writing a book, Automating Document Production - Discuss your special needs here

Tidying OCRed text

Postby RoryOF » Fri Jun 26, 2020 10:27 am

Having OCRed a book and sorted most of the formatting problems with a few Find and Replace passes, I am left with one remaining problem: the transition from one page to another frequently leaves the first line of the next page starting with a lower case character. These I can find using OO's Find and Replace and a regular expression

Find ^[:lower:], Match case checked, More options, Regular Expressions checked

Is there any way, using either Find and Replace or AltSearch, that I can Replace with
<space><found lower char>

that is, omitting the paragraph mark, replacing it with a space and the found lowercase character.

I can and have done such replacements by hand in the past; I ask out of curiosity.

Rory
Apache OpenOffice 4.1.10 on Xubuntu 20.04.2 (mostly 64 bit version) and very infrequently on Win2K/XP
User avatar
RoryOF
Moderator
 
Posts: 32527
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Tidying OCRed text

Postby esperantisto » Fri Jun 26, 2020 10:35 am

Take a look at OOoFBTools, its Join broken lines/paragraphs feature.

P. S. Not an answer to your question, though.
AOO 4.2.0 / LibO 6.x/7.x / Win 7 / openSUSE Linux Leap 15.1 (64-bit)
esperantisto
Volunteer
 
Posts: 545
Joined: Mon Oct 08, 2007 1:31 am

Re: Tidying OCRed text

Postby Villeroy » Fri Jun 26, 2020 10:43 am

With match case and regex turned on:
Search: ^([:lower:])
Replace: _$1
where _ is a literal space
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
User avatar
Villeroy
Volunteer
 
Posts: 29701
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Tidying OCRed text

Postby RoryOF » Fri Jun 26, 2020 10:59 am

@Villeroy: that replaces with the space and the found character, but leaves the paragraph mark in position,. The () brackets gave me the found [:lower:] parameter, which I was lacking.

@esperantisto: I'll look at OOoFBTools later - going out to do Friday things now.

I can see a way of doing this with three F&R passes (I think); I'll come back with that later.
Apache OpenOffice 4.1.10 on Xubuntu 20.04.2 (mostly 64 bit version) and very infrequently on Win2K/XP
User avatar
RoryOF
Moderator
 
Posts: 32527
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Tidying OCRed text

Postby RoryOF » Fri Jun 26, 2020 11:12 am

Subject to testing on a large file, here is a method

Find $ Replace %%%% More options: regular expressions checked. Replace All

Find %%%%([:lower:]) Replace <space>$1 More options: regular expressions checked, Match case checked. Replace All (Match case checked is important!)

Find %%%% Replace \n More options :Regular expressions checked, Replace all.

%%%% is some character or sequence of characters that does not occur in the text. <space> is a literal space character.

 Edit: Tested on a 115K word file. Seems to work correctly subject to checking on proofreading and final layout. 
Apache OpenOffice 4.1.10 on Xubuntu 20.04.2 (mostly 64 bit version) and very infrequently on Win2K/XP
User avatar
RoryOF
Moderator
 
Posts: 32527
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland


Return to Advanced Uses

Who is online

Users browsing this forum: No registered users and 2 guests