Character encoding issuses using OCR (namely: docx2txt, odt2txt)

Character encoding issuses using OCR (namely: docx2txt, odt2txt)#5919

By MarcinP - 19 Dec 2024, 11:32

- 19 Dec 2024, 11:32 #5919

First of all, thank you for the great work! It's a splendid piece of software.

To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntu

I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.

At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.

I edited DocumentToText.php:

Code: Select all

            case DOCUMENT_TYPE_PDF:
                if (PDFTOTEXT_PATH == '')
                {
                    $this->_setError('The PDF format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
                break;

            case DOCUMENT_TYPE_HTML:
                if (HTML2TEXT_PATH == '')
                {
                    $this->_setError('The HTML format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;

                if (SystemUtility::isWindows())
                {
                    $command = 'TYPE ' . $escapedFilename . ' | "'. HTML2TEXT_PATH . '" -nobs ';
                }
                else
                {
                    $command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
                }
                break;

I simply changed all 'ISO-8859-1' to 'UTF-8'.
After that all PDFs began to be recognized perfectly.

Now I can't seem to get it to work for DOCX and ODT.

The code for those:

Code: Select all

case DOCUMENT_TYPE_ODT:
                $this->_rawOutput = $this->odt2text($filename);
                if ( $this->_rawOutput == null )
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;

            case DOCUMENT_TYPE_DOCX:
                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;

There's no obvious place for me to put 'UTF-8'.

And I tried to do something like that:

Code: Select all

 case DOCUMENT_TYPE_DOCX:

                $nativeEncoding = 'UTF-8';
                $convertEncoding = false; 

                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;

But that changes nothing.

I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.

I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.

Last edited by MarcinP on 19 Dec 2024, 13:12, edited 1 time in total.

Re: Character encoding issuses using OCR (namely: docx2txt, odt2txt)#5920

By MarcinP - 19 Dec 2024, 11:57

- 19 Dec 2024, 11:57 #5920

I've just tried the chat to no avail:

Code: Select all

case DOCUMENT_TYPE_DOCX:
    $this->_rawOutput = mb_convert_encoding($this->docx2text($fileName), 'UTF-8', 'auto');
    if ($this->_rawOutput == null) {
        return false;
    }
    $this->_linesArray = explode("\n", $this->_rawOutput);
    $this->_linesString = $this->_rawOutput;
    return true;

case DOCUMENT_TYPE_ODT:
    $this->_rawOutput = mb_convert_encoding($this->odt2text($filename), 'UTF-8', 'auto');
    if ($this->_rawOutput == null) {
        return false;
    }
    $this->_linesArray = explode("\n", $this->_rawOutput);
    $this->_linesString = $this->_rawOutput;
    return true;

That does nothing.
I'll keep looking.

Re: Character encoding issuses using OCR (namely: docx2txt, odt2txt)#6032

By clogstweed - 27 Feb 2025, 08:48

- 27 Feb 2025, 08:48 #6032

Though I had some success, I can't manage to sort out the encoding when uploading data to candidate profile. Mind you, when I check all the OCR tools on the server, they all run well.

Re: Character encoding issuses using OCR (namely: docx2txt, odt2txt)#6203

By MancyHenry - 12 Jun 2025, 10:40

- 12 Jun 2025, 10:40 #6203

I'm having trouble with the encoding while uploading data to the candidate profile. However, all the OCR tools on the server are functioning block blast online correctly when I check them.

Re: Character encoding issuses using OCR (namely: docx2txt, odt2txt)#6284

By Lindberged - 26 Jun 2025, 11:19

- 26 Jun 2025, 11:19 #6284

MarcinP wrote: ↑19 Dec 2024, 11:32 First of all, thank you for the great work! It's a splendid piece of software.

To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntubest sex dolls

I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.

At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.

I edited DocumentToText.php:
Code: Select all
            case DOCUMENT_TYPE_PDF:
                if (PDFTOTEXT_PATH == '')
                {
                    $this->_setError('The PDF format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
                break;

            case DOCUMENT_TYPE_HTML:
                if (HTML2TEXT_PATH == '')
                {
                    $this->_setError('The HTML format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;

                if (SystemUtility::isWindows())
                {
                    $command = 'TYPE ' . $escapedFilename . ' | "'. HTML2TEXT_PATH . '" -nobs ';
                }
                else
                {
                    $command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
                }
                break;
I simply changed all 'ISO-8859-1' to 'UTF-8'.
After that all PDFs began to be recognized perfectly.

Now I can't seem to get it to work for DOCX and ODT.

The code for those:
Code: Select all
case DOCUMENT_TYPE_ODT:
                $this->_rawOutput = $this->odt2text($filename);
                if ( $this->_rawOutput == null )
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;

            case DOCUMENT_TYPE_DOCX:
                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
There's no obvious place for me to put 'UTF-8'.

And I tried to do something like that:
Code: Select all
 case DOCUMENT_TYPE_DOCX:

                $nativeEncoding = 'UTF-8';
                $convertEncoding = false; 

                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
But that changes nothing.

I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.

I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.

I've tried to do some updates with less success, but of course I'll continue.