General discussion of OpenCATS

Moderators: RussH, cptr13

Forum rules: Just remember to play nicely once you walk through the door. You can disagree with us, or any other commenters in this forum, but keep comments directed to the topic at hand.
User avatar
By mabdalla
#1083
These instructions apply to a SUSE Linux installation of CATS where the CATS package has been installed in the directory /srv/www/htdocs/cats. If your installation differs, you will need to make appropriate adjustments to the paths referenced in this documentation.



Basic Requirements
  • Installed and operational CATS system
    functional cron
    LSB compatible init.d for run control


What it Does
The Sphinx package consists of two primary parts, the indexer that creates the search indexes and is run periodically to rebuild those indexes, and the searchd daemon that handles the queries from the sphinxapi.php library.

The CATS integration design calls for a primary index, cats, that is rebuilt once per day via cron.daily. This once per day rebuild picks up all candidates/resumes in the database and completely reindexes the text resume, key skills, and candidate’s first and last names. Additionally, it resets the sph_counter to a high water mark.

A second delta index, catsdelta, handles additions to the database during the business day via a cron script that rebuilds only the secondary delta index based on the high water mark set at the prior run of the primary index. It is evisioned that this script would run every 20 or 30 minutes during the business day to keep that delta index up to date with recent additions to the database.

Installation Instructions
Code: Select all
/usr/local/bin/indexer
/usr/local/bin/searchd
/usr/local/man/man8/searchd.8.gz
/usr/local/etc/sphinx.conf.dist
/usr/share/doc/packages/sphinx-0.9.7-rc2
/usr/share/doc/packages/sphinx-0.9.7-rc2/COPYING
/usr/share/doc/packages/sphinx-0.9.7-rc2/doc
/usr/share/doc/packages/sphinx-0.9.7-rc2/doc/mk.cmd
/usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.css
/usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.html
/usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.txt
/usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.xml
/usr/share/doc/packages/sphinx-0.9.7-rc2/INSTALL
Copy sphinxapi.php
  • - Copy the api/sphinxapi.php file to /srv/www/htdocs/cats/lib/sphinxapi.php
Create Indexer Cron Script
  • - Create the following cron script in /etc/cron.daily/indexer to run the indexer on a daily basis.
Code: Select all
#!/bin/sh
/usr/local/bin/indexer --all --rotate --config /srv/www/htdocs/cats/modules/search/sphinx.conf
- chown root:root /etc/cron.daily/indexer
- chmod 700 /etc/cron.daily/indexer

Create Searchd init.d Script
  • - Create an /etc/init.d/searchd script for the searchd daemon, the example below works well for a SUSE installation, you may need to alter it for non-SUSE distributions.
Code: Select all
#! /bin/sh
# Copyright (c) 1995-2004 SUSE Linux AG, Nuernberg, Germany.
# All rights reserved.
#
# Author: Kurt Garloff
# Please send feedback to http://www.suse.de/feedback/
#
# /etc/init.d/searchd
#   and its symbolic link
# /(usr/)sbin/rcsearchd
#
#    This program is free software; you can redistribute it and/or modify 
#    it under the terms of the GNU General Public License as published by 
#    the Free Software Foundation; either version 2 of the License, or 
#    (at your option) any later version. 
# 
#    This program is distributed in the hope that it will be useful, 
#    but WITHOUT ANY WARRANTY; without even the implied warranty of 
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
#    GNU General Public License for more details. 
# 
#    You should have received a copy of the GNU General Public License 
#    along with this program; if not, write to the Free Software 
#    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
#
### BEGIN INIT INFO
# Provides:          searchd for sphinx
# Required-Start:    $syslog $remote_fs mysql
# Should-Start: $time ypbind sendmail
# Required-Stop:     $syslog $remote_fs
# Should-Stop: $time ypbind sendmail
# Default-Start:     3 5
# Default-Stop:      0 1 2 6
# Short-Description: searchd daemon for sphinx search
# Description:       Starts the Sphinx searchd daemon
### END INIT INFO
# 

# Check for missing binaries (stale symlinks should not happen)
# Note: Special treatment of stop for LSB conformance
LOGFILE=/var/log/searchd.log
SEARCHD=/usr/local/bin/searchd

test -x $SEARCHD || { echo "$SEARCHD not installed"; 
	if [ "$1" = "stop" ]; then exit 0;
	else exit 5; fi; }


# Source LSB init functions
. /etc/rc.status

# Reset status of this service
rc_reset


case "$1" in
    start)
	echo -n "Starting $SEARCHD "
	## Start daemon with startproc(8). If this fails
	## the return value is set appropriately by startproc.
	startproc -l $LOGFILE $SEARCHD --config /srv/www/htdocs/cats/modules/search/sphinx.conf

	# Remember status and be verbose
	rc_status -v
	;;
    stop)
	echo -n "Shutting down $SEARCHD "
	## Stop daemon with killproc(8) and if this fails
	## killproc sets the return value according to LSB.

	killproc -TERM $SEARCHD

	# Remember status and be verbose
	rc_status -v
	;;
    try-restart|condrestart)
	## Do a restart only if the service was active before.
	## Note: try-restart is now part of LSB (as of 1.9).
	## RH has a similar command named condrestart.
	if test "$1" = "condrestart"; then
		echo "${attn} Use try-restart ${done}(LSB)${attn} rather than condrestart ${warn}(RH)${norm}"
	fi
	$0 status
	if test $? = 0; then
		$0 restart
	else
		rc_reset	# Not running is not a failure.
	fi
	# Remember status and be quiet
	rc_status
	;;
    restart)
	## Stop the service and regardless of whether it was
	## running or not, start it again.
	$0 stop
	$0 start

	# Remember status and be quiet
	rc_status
	;;
    force-reload)
	## Signal the daemon to reload its config. Most daemons
	## do this on signal 1 (SIGHUP).
	## If it does not support it, restart.

	echo -n "Reload service $SEARCHD "
	## if it supports it:
	killproc -HUP $SEARCHD
	rc_status -v

	## Otherwise:
	#$0 try-restart
	#rc_status
	;;
    reload)
	## Like force-reload, but if daemon does not support
	## signaling, do nothing (!)

	# If it supports signaling:
	echo -n "Reload service $SEARCHD "
	killproc -HUP $SEARCHD
	rc_status -v
	
	## Otherwise if it does not support reload:
	#rc_failed 3
	#rc_status -v
	;;
    status)
	echo -n "Checking for service $SEARCHD "
	## Check status with checkproc(8), if process is running
	## checkproc will return with exit status 0.

	# Return value is slightly different for the status command:
	# 0 - service up and running
	# 1 - service dead, but /var/run/  pid  file exists
	# 2 - service dead, but /var/lock/ lock file exists
	# 3 - service not running (unused)
	# 4 - service status unknown :-(
	# 5--199 reserved (5--99 LSB, 100--149 distro, 150--199 appl.)
	
	# NOTE: checkproc returns LSB compliant status values.
	checkproc $SEARCHD
	# NOTE: rc_status knows that we called this init script with
	# "status" option and adapts its messages accordingly.
	rc_status -v
	;;
    *)
	echo "Usage: $0 {start|stop|status|try-restart|restart|force-reload|reload|probe}"
	exit 1
	;;
esac
rc_exit
  • - Save the searchd init.d script to /etc/init.d/searchd, make it executable, then install it as a service using insserv searchd.
  • - Create a run control softlink for the init.d script
Code: Select all
ln -s /etc/init.d/searchd /usr/sbin/rcsearchd
Create sphinx.conf
  • Create the sphinx.conf configuration file and save it to /srv/www/htdocs/cats/modules/search. You will need to create the ‘search’ directory since it doesn’t exist yet. Be sure to specify your correct <catsuser> and <catspass> on the configuration lines where indicated. Also be sure to create the index file directory you specify under the index path if it doesn’t exist (/srv/www/htdocs/cats/modules/search/index).
Code: Select all
#
# sphinx configuration file for CATS
#

#############################################################################
## data source definition
#############################################################################

source catsdb
{
	type				= mysql
	strip_html			= 0
	index_html_attrs	=

	# some straightforward parameters for 'mysql' source type
	sql_host			= localhost
	sql_user			= <catsuser>
	sql_pass			= <catspass>
	sql_db			= cats
	sql_port			= 3306	# optional, default is 3306

	sql_query_pre		= REPLACE INTO sph_counter SELECT 1, MAX(attachment_id) from attachment
	sql_query			= \
		SELECT attachment_id, data_item_id, UNIX_TIMESTAMP(attachment.date_created) AS date_added, title, text \
		last_name, first_name, notes, key_skills \
		FROM attachment left join candidate on data_item_id = candidate_id \
		where resume = 1 and attachment.site_id = 1 and data_item_type = 100 \
		and attachment_id <= (SELECT max_doc_id from sph_counter where counter_id = 1)

	sql_group_column	= data_item_id
	sql_date_column		= date_added
	sql_query_post		=
	sql_query_info		= SELECT * FROM attachment WHERE attachment_id=$id

}

source delta : catsdb
{
    sql_query_pre=
    sql_query           = \
        SELECT attachment_id, data_item_id, UNIX_TIMESTAMP(attachment.date_created) AS date_added, title, text \
        last_name, first_name, notes, key_skills \
        FROM attachment left join candidate on data_item_id = candidate_id \
        where resume = 1 and attachment.site_id = 1 and data_item_type = 100 \
        and attachment_id > (SELECT max_doc_id from sph_counter where counter_id = 1)
    
}

#############################################################################
## index definition
#############################################################################

index cats
{
	source			= catsdb

	# this is path and index file name without extension
	#
	# indexer will append different extensions to this path to
	# generate names for both permanent and temporary index files
	#
	# .tmp* files are temporary and can be safely removed
	# if indexer fails to remove them automatically
	#
	# .sp* files are fulltext index data files. specifically,
	# .spa contains attribute values attached to each document id
	# .spd contains doclists and hitlists
	# .sph contains index header (schema and other settings)
	# .spi contains wordlists
	#
	# MUST be defined
	path			= /srv/www/htdocs/cats/modules/search/index/cats

	docinfo			= extern
	morphology			= none
	stopwords			=
	min_word_len		= 1
	charset_type		= sbcs
}

index catsdelta : cats
{
    source          = delta
    path            = /srv/www/htdocs/cats/modules/search/index/cats_delta

}

#############################################################################
## indexer settings
#############################################################################

indexer
{
	mem_limit			= 32M
}

#############################################################################
## searchd settings
#############################################################################

searchd
{

      address = 127.0.0.1
	port				= 3312
	log				= /var/log/searchd.log
	query_log			= /var/log/query.log
	read_timeout		= 5
	max_children		= 30
	pid_file			= /var/run/searchd.pid
	# default is 1000 (just like with Google)
	max_matches			= 1000
}

# --eof--
Add sph_counter to CATS database
  • create the sph_counter table in the CATS database.
Code: Select all
# in MySQL
use cats
CREATE TABLE sph_counter
(
    counter_id INTEGER PRIMARY KEY NOT NULL,
    max_doc_id INTEGER NOT NULL
);

Try creating your index
  • Index your CATS database by running the indexer from the commandline
Code: Select all
helphand:~ # /usr/local/bin/indexer --all --config /srv/www/htdocs/cats/modules/search/sphinx.conf
Sphinx 0.9.7-RC2
Copyright (c) 2001-2006, Andrew Aksyonoff

using config file '/srv/www/htdocs/cats/modules/search/sphinx.conf'...
indexing index 'cats'...
collected 4668 docs, 20.7 MB
sorted 2.1 Mhits, 100.0% done
total 4668 docs, 20663522 bytes
total 3.324 sec, 6216481.50 bytes/sec, 1404.34 docs/sec
helphand:~ #      
Start the Searchd Daemon
  • Assuming your indexer processed without errors, your install is in good shape, so start the searchd daemon.
Code: Select all
helphand:~ #rcsearchd start
Test Search from Commandline
  • Now test the search from the commandline by searching for a resume keyword.
Code: Select all
helphand:~ # search --config /srv/www/htdocs/cats/modules/search/sphinx.conf controller    
     [lot's of returned stuff snipped for brevity]



        date_created=2006-10-11 13:38:19
        date_modified=2006-10-11 13:38:19

words:
1. 'controller': 402 documents, 776 hits
helphand:~ #
Setup Cron for Regular Updates
  • Assuming the search returned expected results, you are almost finished with the Sphinx install. You simply need to add a crontab entry to run the following periodically throughout the business day to index any new candidate resumes added to the database during the day. Create the following file in /etc/cron.d/cats
Code: Select all
# use /bin/sh to run commands, no matter what /etc/passwd says
SHELL=/bin/sh
# mail any output to `root', no matter whose crontab this is
MAILTO=root
PATH=/usr/local/bin
#

# Business Days, Business Hours

   20,50 7-17 * * Mon,Tue,Wed,Thu,Fri        root  $PATH/indexer --rotate --config /srv/www/htdocs/cats/modules/search/sphinx.conf catsdelta >>/dev/null
  • chmod 600 /etc/cron.d/cats
  • You’re done!

If you have any questions about this procedure, or would like to try it out on other unix/linux flavors, drop me a line at mabdalla [at] meait.com. I'd be happy to help[/b]
Last edited by mabdalla on 02 Jul 2010, 03:19, edited 1 time in total. Reason: making it stick for 30 days

This is the &quot;import from resume&quot;[…]

EMAIL CONFIGURATION

Hi, the email configuration (including different[…]

as the title says...

It's essential to keep these synchronized to ensur[…]