- 19 Dec 2024, 11:32
#5919
First of all, thank you for the great work! It's a splendid piece of software.
To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntu
I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.
At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.
I edited DocumentToText.php:
After that all PDFs began to be recognized perfectly.
Now I can't seem to get it to work for DOCX and ODT.
The code for those:
And I tried to do something like that:
I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.
I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.
To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntu
I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.
At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.
I edited DocumentToText.php:
Code: Select all
I simply changed all 'ISO-8859-1' to 'UTF-8'. case DOCUMENT_TYPE_PDF:
if (PDFTOTEXT_PATH == '')
{
$this->_setError('The PDF format has not been configured.');
return false;
}
$nativeEncoding = 'ISO-8859-1';
$convertEncoding = false;
$command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
break;
case DOCUMENT_TYPE_HTML:
if (HTML2TEXT_PATH == '')
{
$this->_setError('The HTML format has not been configured.');
return false;
}
$nativeEncoding = 'ISO-8859-1';
$convertEncoding = false;
if (SystemUtility::isWindows())
{
$command = 'TYPE ' . $escapedFilename . ' | "'. HTML2TEXT_PATH . '" -nobs ';
}
else
{
$command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
}
break;
After that all PDFs began to be recognized perfectly.
Now I can't seem to get it to work for DOCX and ODT.
The code for those:
Code: Select all
There's no obvious place for me to put 'UTF-8'.case DOCUMENT_TYPE_ODT:
$this->_rawOutput = $this->odt2text($filename);
if ( $this->_rawOutput == null )
{
return false;
}
$this->_linesArray = explode("\n", $this->_rawOutput);
$this->_linesString = $this->_rawOutput;
return true;
break;
case DOCUMENT_TYPE_DOCX:
$this->_rawOutput = $this->docx2text($fileName);
if ($this->_rawOutput == null)
{
return false;
}
$this->_linesArray = explode("\n", $this->_rawOutput);
$this->_linesString = $this->_rawOutput;
return true;
break;
And I tried to do something like that:
Code: Select all
But that changes nothing. case DOCUMENT_TYPE_DOCX:
$nativeEncoding = 'UTF-8';
$convertEncoding = false;
$this->_rawOutput = $this->docx2text($fileName);
if ($this->_rawOutput == null)
{
return false;
}
$this->_linesArray = explode("\n", $this->_rawOutput);
$this->_linesString = $this->_rawOutput;
return true;
break;
I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.
I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.
Last edited by MarcinP on 19 Dec 2024, 13:12, edited 1 time in total.