Table des matières
Configuration des options du fichier indexer.conf
Le fichier de configuration de l'indexation pour la recherche web est : /usr/local/mnogosearch/etc/indexer.conf
./indexer --help
Options du programme indexer
Usage: indexer OPTIONS configfile :
Crawler options
- -a : Revoir tous les documents, même s'ils ne sont pas expirés (can be limited using -t, -u, -s, -c, -y and -f options)
- -m : Mettre à jour les documents périmés, même s'ils ne sont pas modifiés (can be limited using -t, -u, -c, -s, -y and -f options)
- -e : Visite en premier les documents les plus anciens ('most expired')
- -o : Visite en premier les documents avec moins de profondeur (hops value)
- -r : Ne pas essayer de réduire les serveurs distants chargés aléatoirement dans la file d'attente du robot('randomising crawler queue order ') … plus rapide, mais moins poli
- -n # : Visit only # documents and exit
- -c # : Visit only # seconds and exit
- -q : Quick startup (do not add Server URLs);
- -qq : even quicker
- -b : Bloque le démarrage de plus d'1 instance du programme indexer
- -i : Insert new URLs (URLs to insert must be given using -u or -f)
- -p # : Sleep # seconds after downloading every URL
- -w : Do not ask for confirmation when clearing documents from the database (e.g.: indexer -Cw)
- -N # : Run # crawler threads
Subsection control options (can be combined):
- -s name Limit indexer to documents matching status (HTTP Status code)
- -t name Limit indexer to documents matching tag
- -g name Limit indexer to documents matching category
- -y name Limit indexer to documents matching content-type
- -L name Limit indexer to documents matching language
- -u name Limit indexer to documents with URLs matching pattern
(supports SQL LIKE wildcards '%' and '_')
- –seed=name Limit indexer to documents with the given seed (0-255)
- -D name Work with the n-th database only (i.e. with the n-th DBAddr)
- -f name Read URLs to be visited/inserted/deleted from file (with -a or -C option, supports SQL LIKE wildcard '%%'; has no effect when combined with -m option)
- -f - Use stdin instead of a file as an URL list
Logging options:
- -l Do not log to stdout/stderr
- -v # Verbose level (0-5)
Misc. options:
- -F name : Print compile configuration and exit (e.g.: indexer -F '*')
- -h, –help : Print help page and exit; -hh print more help
- -? : Print help page and exit;
- -?? : print more help
- -d name : Use the given configuration file instead of indexer.conf. This option is usefull when running indexer as an interpreter, e.g.: #!/usr/local/sbin/indexer -d
- -j name : Set current time for statistic (use with -S), format: YYYY-MMHH[[:MM[[:SS] or time offset, e.g. 1d12h (see Period in indexer.conf)
- –set=name : Set variable
Commands (can be used with subsection control options):
- –crawl Crawl (default command)
- –index Create search index
- –wordstat Create statistics for misspelled word suggestions
- –rewriteurl Rewrite URL data into the current search index
- –rewritelimits Recreate all Limit, UserScore, UserOrder data
- -C, –delete Delete documents from the database
- -S, –statistics Print statistics and exit
- -I, –referers Print referers and exit
- -R Crawl then calculate popularity rank
Other commands:
- –create Create SQL table structure and exit
- –drop Drop SQL table structure and exit
- -Q, –sqlmon Run interactive SQL monitor
- –exec=name Execute SQL query
- –checkconf Check configuration file for good syntax
- –hashspell Create hash files for the active Ispell dictionaries
- –dumpspell Dump Ispell data for use with SQLWordForms
- –dumpdata Dump collected data using SQL statements
- –restoredata Load prevously dumped data (give a filename using -f)
Configuration du fichier indexer.conf
content-type
- UseRemoteContentType yes/no
- AddType [String|Regex] [Case|NoCase] <mime type> <arg> [<arg>…]