General discussion of OpenCATS

Moderators: RussH, cptr13

Forum rules: Just remember to play nicely once you walk through the door. You can disagree with us, or any other commenters in this forum, but keep comments directed to the topic at hand.
User avatar
By zoomiest
#1135
I saw the reference that RussH made in the other forum here, about .docx formats.

So, I wanted to open this up to the group - to see if anyone has found a solution do the .docx format? So far, I am just saving .docx files to .doc before uploading them, with a sign on my portal, listing acceptable file formats.

What everybody else doing?
User avatar
By RussH
#1137
Okay, had a quick look through the code and seems relatively simple -

you'd need to update config.php
http://subversion.assembla.com/svn/open ... config.php

and the Document to Text code that calls the conversion apps;
http://subversion.assembla.com/svn/open ... ToText.php

but that may be enough to get it working! Definitely one for inclusion in 0.9.2 :-)
User avatar
By zoomiest
#1138
RussH,
Thanks.

Also, I don't have the skills to make the adjustments to the config.php file, or to the document conversion library, without screwing things up... How is the progress on new 0.9.2 version?

Should I wait for it, or should I be asking for help from some PHP developers to implement docx2txt?
By jpjanze
#1169
Hi Mabdalla, I have been trying to get this to work. Seems fairly straightforward. Problem is I am not a coder at heart. Can usually futz my way through to get things working but am stuck with this one.

I have docx2txt installed and working (linux). I know because I can call the perl script from command line and it converts to text (can get it to dump to file or dump to standard out).

I have edited config.php to define docx2txt

I have edits lib/documenttotext to update the section to actual do something for docx.....problem I am having at this point is that when I try to upload a docx to a profile I get the same error message "Unable to load your resume contents. Your resume will still be uploaded and attached to your application."

My bet is I have not done something incorrectly, but can you point me to a likely place to start looking? or how I would troubleshoot this? I don't know if this is logged somewhere that I can look at? etc.

this is my documenttotext file code additions
Code: Select all
case DOCUMENT_TYPE_DOCX:
                if (DOCX2TXT_PATH == '')
                {
                    $this->_setError('The DOCX format has not been configured, which is required for the DOCX format.');
                    return false;
                }
                $nativeEncoding = 'ISO-8859-1';
                $command = '"'. DOCX2TXT_PATH . '" '.$escapedFilename . ' -';
                break;
(If I can get this working I will document it for others)

Any and all help appreciated.

Cheers!
User avatar
By mabdalla
#1216
OMG! I was just browsing the forum and found this question for me since November! I'm so sorry i just saw this right now. I have put a watch on this thread but i only learned about this question right now.

If you haven't solved this problem yet, let me know and i'll try to help out.

Regards,

-MA
User avatar
By mabdalla
#1269
First you have to make sure you have Perl installed. test it by typing perl on the prompt. If it's not, install it.

Then, Install docx2txt (from here http://docx2txt.sourceforge.net/) and put it in ~/catsbin/docx2txt/docx2txt.pl

Edit your config.php file to add the following (line 56 only):
Code: Select all
     53 define('ANTIWORD_PATH', "/usr/bin/antiword");
     54 define('ANTIWORD_MAP', '8859-1.txt');
     55
     56 define('DOCX2TXT_PATH','/path/to/cats/catsbin/docx2txt/docx2txt.pl');
Then edit "vi ./lib/DocumentToText.php"

Add this line after the DOCUMENT_TYPE_RTF case. It should look like this, but do not relay on my file line numbers cause we have this file edited heavily. I do not know what the original line numbers were
Code: Select all
 
    177             case DOCUMENT_TYPE_DOCX:
    178                 if (DOCX2TXT_PATH == ''){
    179                     $this->_setError('The DOCX format has not been configured.');
    180                     return false;
    181                 }
    182
    183                 $nativeEncoding = 'UTF-8';
    184                 $command = 'perl '. DOCX2TXT_PATH . ' ' . $escapedFilename . ' -';
    185                 break;
This should do it. If you face trouble, let me know.

Regards,

-MA
#1286
We modified our CATS 0.8.0 to handle WordPerfect, Works, Docx, and ODT documents. On CentOS 5.x, we had to download and compile libwps to handle Works. We downloaded and compiled doctotext from silvercoders.com (GPL) to handle docx conversion. WordPerfect conversion is handled by libwpd (EPEL repo?). ODT conversion is handled by odt2txt (RPMForge repo).

We're currently working on teaching CATS how to handle (OCR) image-based PDF files. When we get that code ready, I will post it too.

Switch command in DocumentToText.php:
Code: Select all
        /* Use different methods to extract text depending on the type of document. */
        switch ($documentType)
        {
            case DOCUMENT_TYPE_DOC:
                if (ANTIWORD_PATH == '')
                {
                    $this->_setError('The DOC format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $command = '"'. ANTIWORD_PATH . '" -m ' . ANTIWORD_MAP . ' '
                    . $escapedFilename;
                break;

            case DOCUMENT_TYPE_PDF:
                if (PDFTOTEXT_PATH == '')
                {
                    $this->_setError('The PDF format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. PDFTOTEXT_PATH . '" -layout ' . $escapedFilename . ' -';
                break;

            case DOCUMENT_TYPE_HTML:
                if (HTML2TEXT_PATH == '')
                {
                    $this->_setError('The HTML format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. HTML2TEXT_PATH . '" -nobs ' . $escapedFilename;
                break;

           case DOCUMENT_TYPE_TEXT:
                return $this->_readTextFile($fileName);
                break;

            case DOCUMENT_TYPE_RTF;
                if (HTML2TEXT_PATH == '')
                {
                    $this->_setError('The HTML format has not been configured, which is required for the RTF format.');
                    return false;
                }

                if (UNRTF_PATH == '')
                {
                    $this->_setError('The RTF format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $convertEncoding = false;
                $command = '"'. UNRTF_PATH . '" '.$escapedFilename.' | "'. HTML2TEXT_PATH . '" -nobs ';
                break;

            case DOCUMENT_TYPE_ODT:
                if (ODT2TXT_PATH == '')
                {
                    $this->_setError('The ODT format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $command = '"'. ODT2TXT_PATH . '" ' . $escapedFilename;
                break;

            case DOCUMENT_TYPE_DOCX:
                if (DOCTOTEXT_PATH == '')
                {
                    $this->_setError('The DOCX format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $command = '"'. DOCTOTEXT_PATH . '" ' . $escapedFilename;
                break;

            case DOCUMENT_TYPE_WPD:
                if (WPD2TEXT_PATH == '')
                {
                    $this->_setError('The WPD format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $command = '"'. WPD2TEXT_PATH . '" ' . $escapedFilename;
                break;

            case DOCUMENT_TYPE_WPS:
                if (WPS2TEXT_PATH == '')
                {
                    $this->_setError('The WPS format has not been configured.');
                    return false;
                }

                $nativeEncoding = 'ISO-8859-1';
                $command = '"'. WPS2TEXT_PATH . '" ' . $escapedFilename;
                break;

            case DOCUMENT_TYPE_UNKNOWN:
            default:
                $this->_setError('This file format is unknown format and is not yet supported by CATS.');
                return false;
                break;
        }
Defines in FileUtility.php:
Code: Select all
define('DOCUMENT_TYPE_UNKNOWN', 0);
define('DOCUMENT_TYPE_PDF',     100);
define('DOCUMENT_TYPE_DOC',     200);
define('DOCUMENT_TYPE_RTF',     300);
define('DOCUMENT_TYPE_DOCX',    400);
define('DOCUMENT_TYPE_HTML',    500);
define('DOCUMENT_TYPE_ODT',     600);
define('DOCUMENT_TYPE_TEXT',    700);
define('DOCUMENT_TYPE_WPD',     800);
define('DOCUMENT_TYPE_WPS',     900);
getDocumentType function in FileUtility.php:
Code: Select all
    public static function getDocumentType($filename, $contentType = false)
    {
     	$fileExtension = self::getFileExtension($filename);

        if ($contentType === 'text/plain' || $fileExtension == 'txt')
        {
            return DOCUMENT_TYPE_TEXT;
        }

	if ($contentType == 'application/rtf' || $contentType == 'text/rtf' ||
            $contentType == 'text/richtext' || $fileExtension == 'rtf')
        {
            return DOCUMENT_TYPE_RTF;
        }

	if ($contentType == 'application/msword' || $fileExtension == 'doc')
        {
            return DOCUMENT_TYPE_DOC;
        }

	if ($contentType == 'application/vnd.ms-word.document.12' ||
            $fileExtension == 'docx')
        {
            return DOCUMENT_TYPE_DOCX;
        }

        if ($contentType == 'application/pdf' || $fileExtension == 'pdf')
        {
            return DOCUMENT_TYPE_PDF;
        }

	if ($contentType === 'text/html' || $fileExtension == 'html' ||
            $fileExtension == 'htm')
        {
            return DOCUMENT_TYPE_HTML;
        }

	if ($contentType === 'application/vnd.oasis.opendocument.text' ||
            $contentType === 'application/x-vnd.oasis.opendocument.text' ||
            $fileExtension == 'odt')
        {
            return DOCUMENT_TYPE_ODT;
        }

        if ($contentType === 'application/wordperfect' || $fileExtension == 'wpd')
        {
            return DOCUMENT_TYPE_WPD;
        }

	if ($contentType === 'application/vnd.ms-works' ||
            $contentType === 'application/x-msworks-wp' ||
            $contentType === 'zz-application/zz-winassoc-wps' ||
            $fileExtension == 'wps')
        {
            return DOCUMENT_TYPE_WPS;
        }

	return DOCUMENT_TYPE_UNKNOWN;
    }
Parser settings in Config.php:
Code: Select all
/* Text parser settings. Remember to use double backslashes (\) to represent
 * one backslash (\). On Windows, installing in C:\antiword\ is
 * recomended, in which case you should set ANTIWORD_PATH (below) to
 * 'C:\\antiword\\antiword.exe'. Windows Antiword will have problems locating
 * mapping files if you install it anywhere but C:\antiword\.
 */
define('ANTIWORD_PATH', "/usr/bin/antiword");
define('ANTIWORD_MAP', '8859-1.txt');

/* XPDF / pdftotext settings. Remember to use double backslashes (\) to represent
 * one backslash (\).
 * http://www.foolabs.com/xpdf/
 */
define('PDFTOTEXT_PATH', "/usr/bin/pdftotext");

/* html2text settings. Remember to use double backslashes (\) to represent
 * one backslash (\). 'html2text' can be found at:
 * http://www.mbayer.de/html2text/
 */
define('HTML2TEXT_PATH', "/usr/bin/html2text");

/* UnRTF settings. Remember to use double backslashes (\) to represent
 * one backslash (\). 'unrtf' can be found at:
 * http://www.gnu.org/software/unrtf/unrtf.html
 */
define('UNRTF_PATH', "/usr/bin/unrtf");

/* ODT2TXT settings. Remember to use double backslashes (\) to represent
 * one backslash (\). 'odt2txt' can be found at:
 * http://stosberg.net/odt2txt/
 */
define('ODT2TXT_PATH', "/usr/bin/odt2txt");

/* DOCTOTEXT settings. Remember to use double backslashes (\) to represent
 * one backslash (\). 'doctotext' can be found at:
 * http://sourceforge.net/projects/doctotext/
 */

/* WPD2TEXT settings. Remember to use double backslashes (\) to represent
 * one backslash (\). 'wpd2text' can be found at:
 * http://libwpd.sourceforge.net/
 */
define('WPD2TEXT_PATH', "/usr/bin/wpd2text");

/* WPS2TEXT settings. Remember to use double backslashes (\) to represent
 * one backslash (\). 'wps2text' can be found at:
 * http://libwps.sourceforge.net/
 */
define('WPS2TEXT_PATH', "/usr/bin/wps2text");
User avatar
By RussH
#1346
Hi gmiller, this looks great - care to contribute to the subversion repo?

adidas Yeezy Boost 350 V2 Casual Shoes adidas Yeez[…]

First of all, thank you for the great work! It's[…]

Adidas Yeezy Boost 350 V2 True Form Yeezy 350 Boo[…]

It's interesting how subtle environment issues can[…]