Page 1 of 1

Character encoding issuses using OCR (namely: docx2txt, odt2txt)

Posted: 19 Dec 2024, 11:32
by MarcinP
First of all, thank you for the great work! It's a splendid piece of software.

To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntu

I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.

At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.

I edited DocumentToText.php:
Code: Select all
            case DOCUMENT_TYPE_PDF:
                if (PDFTOTEXT_PATH == '')
                {
                    $this->_setError('The PDF format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
                break;

            case DOCUMENT_TYPE_HTML:
                if (HTML2TEXT_PATH == '')
                {
                    $this->_setError('The HTML format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;

                if (SystemUtility::isWindows())
                {
                    $command = 'TYPE ' . $escapedFilename . ' | "'. HTML2TEXT_PATH . '" -nobs ';
                }
                else
                {
                    $command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
                }
                break;
I simply changed all 'ISO-8859-1' to 'UTF-8'.
After that all PDFs began to be recognized perfectly.

Now I can't seem to get it to work for DOCX and ODT.

The code for those:
Code: Select all
case DOCUMENT_TYPE_ODT:
                $this->_rawOutput = $this->odt2text($filename);
                if ( $this->_rawOutput == null )
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;

            case DOCUMENT_TYPE_DOCX:
                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
There's no obvious place for me to put 'UTF-8'.

And I tried to do something like that:
Code: Select all
 case DOCUMENT_TYPE_DOCX:

                $nativeEncoding = 'UTF-8';
                $convertEncoding = false; 

                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
But that changes nothing.

I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.

I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.

Re: Character encoding issuses using OCR (namely: docx2txt, odt2txt)

Posted: 19 Dec 2024, 11:57
by MarcinP
I've just tried the chat to no avail:
Code: Select all
case DOCUMENT_TYPE_DOCX:
    $this->_rawOutput = mb_convert_encoding($this->docx2text($fileName), 'UTF-8', 'auto');
    if ($this->_rawOutput == null) {
        return false;
    }
    $this->_linesArray = explode("\n", $this->_rawOutput);
    $this->_linesString = $this->_rawOutput;
    return true;

case DOCUMENT_TYPE_ODT:
    $this->_rawOutput = mb_convert_encoding($this->odt2text($filename), 'UTF-8', 'auto');
    if ($this->_rawOutput == null) {
        return false;
    }
    $this->_linesArray = explode("\n", $this->_rawOutput);
    $this->_linesString = $this->_rawOutput;
    return true;

That does nothing.
I'll keep looking.

Re: Character encoding issuses using OCR (namely: docx2txt, odt2txt)

Posted: 27 Feb 2025, 08:48
by clogstweed
Though I had some success, I can't manage to sort out the encoding when uploading data to candidate profile. Mind you, when I check all the OCR tools on the server, they all run well.

Re: Character encoding issuses using OCR (namely: docx2txt, odt2txt)

Posted: 12 Jun 2025, 10:40
by MancyHenry
I'm having trouble with the encoding while uploading data to the candidate profile. However, all the OCR tools on the server are functioning block blast online correctly when I check them.

Re: Character encoding issuses using OCR (namely: docx2txt, odt2txt)

Posted: 26 Jun 2025, 11:19
by Lindberged
MarcinP wrote: 19 Dec 2024, 11:32 First of all, thank you for the great work! It's a splendid piece of software.

To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntubest sex dolls

I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.

At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.

I edited DocumentToText.php:
Code: Select all
            case DOCUMENT_TYPE_PDF:
                if (PDFTOTEXT_PATH == '')
                {
                    $this->_setError('The PDF format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
                break;

            case DOCUMENT_TYPE_HTML:
                if (HTML2TEXT_PATH == '')
                {
                    $this->_setError('The HTML format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;

                if (SystemUtility::isWindows())
                {
                    $command = 'TYPE ' . $escapedFilename . ' | "'. HTML2TEXT_PATH . '" -nobs ';
                }
                else
                {
                    $command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
                }
                break;
I simply changed all 'ISO-8859-1' to 'UTF-8'.
After that all PDFs began to be recognized perfectly.

Now I can't seem to get it to work for DOCX and ODT.

The code for those:
Code: Select all
case DOCUMENT_TYPE_ODT:
                $this->_rawOutput = $this->odt2text($filename);
                if ( $this->_rawOutput == null )
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;

            case DOCUMENT_TYPE_DOCX:
                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
There's no obvious place for me to put 'UTF-8'.

And I tried to do something like that:
Code: Select all
 case DOCUMENT_TYPE_DOCX:

                $nativeEncoding = 'UTF-8';
                $convertEncoding = false; 

                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
But that changes nothing.

I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.

I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.
I've tried to do some updates with less success, but of course I'll continue.