Please describe the issue you're having
If you accept the answer, please mark the topic as [SOLVED] by clicking the tick.

Moderators: RussH, cptr13

Forum rules: Just please remember to play nicely once you walk through the door. You can disagree with us, or any other commenters in this forum, but respect our space and keep your comments directed to the topic at hand.
#5919
First of all, thank you for the great work! It's a splendid piece of software.

To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntu

I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.

At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.

I edited DocumentToText.php:
Code: Select all
            case DOCUMENT_TYPE_PDF:
                if (PDFTOTEXT_PATH == '')
                {
                    $this->_setError('The PDF format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
                break;

            case DOCUMENT_TYPE_HTML:
                if (HTML2TEXT_PATH == '')
                {
                    $this->_setError('The HTML format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;

                if (SystemUtility::isWindows())
                {
                    $command = 'TYPE ' . $escapedFilename . ' | "'. HTML2TEXT_PATH . '" -nobs ';
                }
                else
                {
                    $command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
                }
                break;
I simply changed all 'ISO-8859-1' to 'UTF-8'.
After that all PDFs began to be recognized perfectly.

Now I can't seem to get it to work for DOCX and ODT.

The code for those:
Code: Select all
case DOCUMENT_TYPE_ODT:
                $this->_rawOutput = $this->odt2text($filename);
                if ( $this->_rawOutput == null )
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;

            case DOCUMENT_TYPE_DOCX:
                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
There's no obvious place for me to put 'UTF-8'.

And I tried to do something like that:
Code: Select all
 case DOCUMENT_TYPE_DOCX:

                $nativeEncoding = 'UTF-8';
                $convertEncoding = false; 

                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
But that changes nothing.

I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.

I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.
Last edited by MarcinP on 19 Dec 2024, 13:12, edited 1 time in total.
#5920
I've just tried the chat to no avail:
Code: Select all
case DOCUMENT_TYPE_DOCX:
    $this->_rawOutput = mb_convert_encoding($this->docx2text($fileName), 'UTF-8', 'auto');
    if ($this->_rawOutput == null) {
        return false;
    }
    $this->_linesArray = explode("\n", $this->_rawOutput);
    $this->_linesString = $this->_rawOutput;
    return true;

case DOCUMENT_TYPE_ODT:
    $this->_rawOutput = mb_convert_encoding($this->odt2text($filename), 'UTF-8', 'auto');
    if ($this->_rawOutput == null) {
        return false;
    }
    $this->_linesArray = explode("\n", $this->_rawOutput);
    $this->_linesString = $this->_rawOutput;
    return true;

That does nothing.
I'll keep looking.
#6284
MarcinP wrote: 19 Dec 2024, 11:32 First of all, thank you for the great work! It's a splendid piece of software.

To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntubest sex dolls

I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.

At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.

I edited DocumentToText.php:
Code: Select all
            case DOCUMENT_TYPE_PDF:
                if (PDFTOTEXT_PATH == '')
                {
                    $this->_setError('The PDF format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
                break;

            case DOCUMENT_TYPE_HTML:
                if (HTML2TEXT_PATH == '')
                {
                    $this->_setError('The HTML format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;

                if (SystemUtility::isWindows())
                {
                    $command = 'TYPE ' . $escapedFilename . ' | "'. HTML2TEXT_PATH . '" -nobs ';
                }
                else
                {
                    $command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
                }
                break;
I simply changed all 'ISO-8859-1' to 'UTF-8'.
After that all PDFs began to be recognized perfectly.

Now I can't seem to get it to work for DOCX and ODT.

The code for those:
Code: Select all
case DOCUMENT_TYPE_ODT:
                $this->_rawOutput = $this->odt2text($filename);
                if ( $this->_rawOutput == null )
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;

            case DOCUMENT_TYPE_DOCX:
                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
There's no obvious place for me to put 'UTF-8'.

And I tried to do something like that:
Code: Select all
 case DOCUMENT_TYPE_DOCX:

                $nativeEncoding = 'UTF-8';
                $convertEncoding = false; 

                $this->_rawOutput = $this->docx2text($fileName);
                if ($this->_rawOutput == null)
                {
                    return false;
                }
                $this->_linesArray = explode("\n", $this->_rawOutput);
                $this->_linesString = $this->_rawOutput;

                return true;
                break;
But that changes nothing.

I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.

I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.
I've tried to do some updates with less success, but of course I'll continue.

First of all, thank you for the great work! It's[…]

Adidas Yeezy Boost 350 V2 True Form Yeezy 350 Boo[…]

It's interesting how subtle environment issues can[…]

Inat Box Indir process is extremely easy and fast […]