Character encoding issuses using OCR (namely: docx2txt, odt2txt)
Posted: 19 Dec 2024, 11:32
First of all, thank you for the great work! It's a splendid piece of software.
To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntu
I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.
At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.
I edited DocumentToText.php:
After that all PDFs began to be recognized perfectly.
Now I can't seem to get it to work for DOCX and ODT.
The code for those:
And I tried to do something like that:
I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.
I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.
To the point. I followed these instructions during installation:
https://documentation.opencats.org/inst ... -on-ubuntu
I don't know anything about coding and I can't seem to work out the encoding during uploading files to candidate profile, although I had a bit of a success.
Mind you, all the OCR tools work just fine on the server when I check them. The encoding is fine (UTF-8) and all characters are correctly recognized. Database is set utf8mb4.
At default installation I got incomprehensible results with all MIME types, but I managed to get it right for pdftotxt.
I edited DocumentToText.php:
Code: Select all
I simply changed all 'ISO-8859-1' to 'UTF-8'. case DOCUMENT_TYPE_PDF:
if (PDFTOTEXT_PATH == '')
{
$this->_setError('The PDF format has not been configured.');
return false;
}
$nativeEncoding = 'ISO-8859-1';
$convertEncoding = false;
$command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
break;
case DOCUMENT_TYPE_HTML:
if (HTML2TEXT_PATH == '')
{
$this->_setError('The HTML format has not been configured.');
return false;
}
$nativeEncoding = 'ISO-8859-1';
$convertEncoding = false;
if (SystemUtility::isWindows())
{
$command = 'TYPE ' . $escapedFilename . ' | "'. HTML2TEXT_PATH . '" -nobs ';
}
else
{
$command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
}
break;
After that all PDFs began to be recognized perfectly.
Now I can't seem to get it to work for DOCX and ODT.
The code for those:
Code: Select all
There's no obvious place for me to put 'UTF-8'.case DOCUMENT_TYPE_ODT:
$this->_rawOutput = $this->odt2text($filename);
if ( $this->_rawOutput == null )
{
return false;
}
$this->_linesArray = explode("\n", $this->_rawOutput);
$this->_linesString = $this->_rawOutput;
return true;
break;
case DOCUMENT_TYPE_DOCX:
$this->_rawOutput = $this->docx2text($fileName);
if ($this->_rawOutput == null)
{
return false;
}
$this->_linesArray = explode("\n", $this->_rawOutput);
$this->_linesString = $this->_rawOutput;
return true;
break;
And I tried to do something like that:
Code: Select all
But that changes nothing. case DOCUMENT_TYPE_DOCX:
$nativeEncoding = 'UTF-8';
$convertEncoding = false;
$this->_rawOutput = $this->docx2text($fileName);
if ($this->_rawOutput == null)
{
return false;
}
$this->_linesArray = explode("\n", $this->_rawOutput);
$this->_linesString = $this->_rawOutput;
return true;
break;
I could please use some help if at all possible. I've run out of ideas and don't think I can make it on my own.
I'll keep researching it, but will be eternally grateful if you could maybe guide me bit.