# ----- Example 6 - Spider using "prog" feature ------- # # Please see the swish-e documentation for # information on configuration directives. # Documentation is included with the swish-e # distribution, and also can be found on-line # at http://swish-e.org # # # This example demonstrates how to use the # new (as of 2.2) "prog" document source feature # to spider a webserver. # # The "prog" document source feature allows # an external program to feed documents to # swish, one after another. This allows you # to index documents from any source (e.g. web, DBMS) # and to filter and adjust the content before swish # indexes the content. # # This example uses the provided spider.pl program # to spider a remote web server. This spider offers # more features than the "http" spider method shown # in example7.config. # # ** Please don't test with this exact config ** # spider your own web server # # Indexing (spidering) is started with the following # command issued from the "conf" directory: # # swish-e -S prog -c example6.config # # Note: You should have the current Bundle::LWP bundle # of perl modules installed. This was tested with: # libwww-perl-5.53 # Run "perldoc spider.pl" in the prog-bin directory for # more information. # # ** Do not spider a web server without permission ** # #--------------------------------------------------- # Include our site-wide configuration settings: IncludeConfigFile example4.config # Specify the program to run IndexDir ../prog-bin/spider.pl # When running under the "prog" document source method you can # pass a list of parameters to the program (specified with -i or IndexDir). # If a parameter is passed to spider.pl, it will use that as the configuration # file. # As a special case, the word "default" followed by URL(s). # In this case the spider will use default settings to spider the provided URLs. SwishProgParameters default http://swish-e.org # Note: the default used by spider.pl is SwishSpiderConfig.pl. # See prog-bin/SwishSpiderConfig.pl for examples # that include filtering PDF and MS Word documents. # Tell swish that about how to parse the content DefaultContents HTML IndexContents HTML .htm .html IndexContents TXT .txt .conf # Just to make it interesting, let's modify the URL that get's indexed: # replace http://swish-e.org/ => http:/localhost/ ReplaceRules replace swish-e.org localhost # end of example