• 9
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

I have a TessBaseAPI() object with a returned object. I want to extract the words with their bounding box but can't seem to get it working.

val Text = tesseract.getUTF8Text()

gives me the text.

val Words = tesseract.getWords.boxRects

gives me the bounding boxes that I can loop through but they don't match with getUTF8Text().

Looping through the data object in tesseract.getWords and trying to convert it to string gives me jibberish.

val Words = tesseract.getWords
for(i in Words) {
    Log.i(TAG, i.data.toString())
}

I found a really bad workaround by using .getHOCRText and doing regex on the produced content to get the text and the boxes.

val result = tesseract.getHOCRText(0)

val BoxPattern = Pattern.compile("(?<=title='bbox ).*?(?=; x_wconf)")
val BoxMatch = BoxPattern.matcher(result)
while(BoxMatch.find()) {
    Log.i(TAG, BoxMatch.group().toString())
}

val TextPattern = Pattern.compile("(?<='>).*?(?=<\\/span>)")
val TextMatch = TextPattern.matcher(result)
while(TextMatch.find()) {
    Log.i(TAG, TextMatch.group().toString())
}

So, how can I properly extract the text and boxRects from tess-two?

I solved it!

// As before
val tesseract = TessBaseAPI()
tesseract.init("/storage/emulated/0/com.ubft/", "eng")
tesseract.setImage(bm)

// Call utF8Text. Otherwise iterator returns null
tesseract.utF8Text

// Initiate an iterator
val iterator = tesseract.getResultIterator()

iterator.begin()
do {
    val text = iterator.getUTF8Text(TessBaseAPI.PageIteratorLevel.RIL_TEXTLINE)
    val boundingBox = iterator.getBoundingRect(TessBaseAPI.PageIteratorLevel.RIL_TEXTLINE)

    // Do what you want with the result...

    } while (iterator.next(TessBaseAPI.PageIteratorLevel.RIL_TEXTLINE))

iterator.delete()

The TessbaseAPI.PageIteratorLevel can be specified to the type of text structure you want returned (paragraph, words, lines or by character).

  • 1
Reply Report