Sophie

Sophie

distrib > Mandriva > 2010.0 > i586 > media > contrib-release > by-pkgid > a2d29ba77c8fe4d655c72d0b897f51ad > files > 169

mnogosearch-3.3.8-3mdv2010.0.i586.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML
><HEAD
><TITLE
>&#13;   Segmenters for Chinese, Thai and Japanese languages
  </TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
REL="HOME"
TITLE="mnoGoSearch 3.3.8 reference manual"
HREF="index.html"><LINK
REL="UP"
TITLE="Multiple languages support"
HREF="msearch-international.html"><LINK
REL="PREVIOUS"
TITLE="Search pages with multi-lingual interface
    
  "
HREF="msearch-multilang.html"><LINK
REL="NEXT"
TITLE="Indexing multilingual servers"
HREF="msearch-vary.html"><LINK
REL="STYLESHEET"
TYPE="text/css"
HREF="mnogo.css"><META
NAME="Description"
CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META
NAME="Keywords"
CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD
><BODY
CLASS="sect1"
BGCOLOR="#EEEEEE"
TEXT="#000000"
LINK="#000080"
VLINK="#800080"
ALINK="#FF0000"
><!--#include virtual="body-before.html"--><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
><SPAN
CLASS="application"
>mnoGoSearch</SPAN
> 3.3.8 reference manual: Full-featured search engine software</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="msearch-multilang.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
>Chapter 9. Multiple languages support</TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="msearch-vary.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="sect1"
><H1
CLASS="sect1"
><A
NAME="cjk"
>Segmenters for Chinese, Thai and Japanese languages</A
></H1
><P
>&#13;  Unlike in the Western languages, texts in the East Asian languages
  Chinese, Thai and Japanese may not have spaces between words in a phrase.
  Thus, when indexing documents in these languages, 
  a search engine needs to know how to 
  split phrases into separate words, and also
  needs to know the word boundaries when running a search query.
  <SPAN
CLASS="application"
>mnoGoSearch</SPAN
> can find Asian word
  boundaries with help of so called <TT
CLASS="literal"
>segmenters</TT
>.
  </P
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="ja-segment"
>Japanese phrase segmenter
      <A
NAME="AEN4098"
></A
></A
></H2
><P
>&#13;    <SPAN
CLASS="application"
>mnoGoSearch</SPAN
> can use 
    <SPAN
CLASS="application"
>&#13;      <A
HREF="http://chasen.aist-nara.ac.jp/"
TARGET="_top"
>ChaSen</A
>
    </SPAN
> and
    <SPAN
CLASS="application"
>&#13;      <A
HREF="http://mecab.sourceforge.net/"
TARGET="_top"
>MeCab</A
>
    </SPAN
>
    Japanese morphological systems to break phrases into words.
    </P
><P
>To build <SPAN
CLASS="application"
>mnoGoSearch</SPAN
>
    with Japanese phrase segmenting, use
    either <CODE
CLASS="option"
>--with-chasen</CODE
> or <CODE
CLASS="option"
>--with-mecab</CODE
>
    command line switches when running <SPAN
CLASS="application"
>configure</SPAN
>.
    </P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="zh-segment"
>Chinese phrase segmenter
      <A
NAME="AEN4114"
></A
></A
></H2
><P
>&#13;      <SPAN
CLASS="application"
>mnoGoSearch</SPAN
> uses
      frequency dictionaries for Chinese phrase segmenting.
      Segmenting is implemented using the
      <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>dynamic programming method</I
></SPAN
>
      to maximize the cumulative frequency of the separate
      words produced from a phrase.
    </P
><P
>&#13;      <SPAN
CLASS="application"
>mnoGoSearch</SPAN
>  distribution
      includes two Chinese dictionaries:
      <TT
CLASS="filename"
>mandarin.freq</TT
> -
      a <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>Simplified Chinese</I
></SPAN
> dictionary
      and <TT
CLASS="filename"
>TraditionalChinese.freq</TT
> -
      a <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>Traditional Chinese</I
></SPAN
> dictionary.
    </P
><DIV
CLASS="note"
><BLOCKQUOTE
CLASS="note"
><P
><B
>Note: </B
>
       When building <SPAN
CLASS="application"
>mnoGoSearch</SPAN
> from sources
       for use with Chinese language, don't forget to add
       <CODE
CLASS="option"
>--with-extra-charsets=big5,gb2313</CODE
> when
       running <SPAN
CLASS="application"
>configure</SPAN
>.
      </P
></BLOCKQUOTE
></DIV
><P
>&#13;    Use the <B
CLASS="command"
><A
HREF="msearch-cmdref-loadchineselist.html"
>LoadChineseList</A
></B
>
    command to enable Chinese phrase segmenting, with this format:
<PRE
CLASS="programlisting"
>&#13;LoadChineseList [charset filename]
</PRE
>
    You can optionally specify the character set name and the
    file name of a dictionary.
    <DIV
CLASS="note"
><BLOCKQUOTE
CLASS="note"
><P
><B
>Note: </B
>
      <A
HREF="msearch-cmdref-loadchineselist.html"
>LoadChineseList</A
> will load the dictionary
      for <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>Simplified Chinese</I
></SPAN
> by default, that is using
      the <TT
CLASS="literal"
>GB2312</TT
> character set set and the file
      <TT
CLASS="filename"
>mandarin.freq</TT
>. Anyway, you may find it
      convenient to specify the default values explicitly:
<PRE
CLASS="programlisting"
>&#13;LoadChineseList gb2312 mandarin.freq
</PRE
>
      </P
></BLOCKQUOTE
></DIV
>
    To enable <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>Traditional Chinese</I
></SPAN
> segmenting,
    use this command:
<PRE
CLASS="programlisting"
>&#13;LoadChineseList big5 TraditionalChinese.freq
</PRE
>
    </P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="th-segment"
>Thai phrase segmenter
      <A
NAME="AEN4146"
></A
></A
></H2
><P
>&#13;      Thai segmenting uses the same method with
      segmenting for Chinese, with help of 
      a Thai frequency dictionary <TT
CLASS="filename"
>thai.freq</TT
>,
      which is included into <SPAN
CLASS="application"
>mnoGoSearch</SPAN
>
      distribution.
    </P
><P
>&#13;    Use the <B
CLASS="command"
><A
HREF="msearch-cmdref-loadthailist.html"
>LoadThaiList</A
></B
>
    to enable Thai phrase segmenting, with this format:
<PRE
CLASS="programlisting"
>&#13;LoadThaiList [charset dictionaryfilename]
</PRE
>
    <DIV
CLASS="note"
><BLOCKQUOTE
CLASS="note"
><P
><B
>Note: </B
>
      The <TT
CLASS="literal"
>TIS-620</TT
> character set and the file
      <TT
CLASS="filename"
>thai.freq</TT
> are used by default. That is
      if you use <B
CLASS="command"
><A
HREF="msearch-cmdref-loadthailist.html"
>LoadThaiList</A
></B
>
      without any arguments, it will be effectively the same to this command:
<PRE
CLASS="programlisting"
>&#13;LoadThaiList tis-620 thai.freq
</PRE
>
      </P
></BLOCKQUOTE
></DIV
>
    </P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="cjk-segment"
>The <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
> phrase segmenter
      <A
NAME="AEN4166"
></A
></A
></H2
><P
>&#13;      Starting from the version <TT
CLASS="literal"
>3.3.8</TT
>,
      <SPAN
CLASS="application"
>mnoGoSearch</SPAN
> also supports
      a special universal segmenter which is suitable
      for Japanese, Tradtitional Chinese and Simplied Chinese.
      The universal <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
> segmenter does not use
      dictionaries and does not require external libraries.
    </P
><P
>&#13;    You can enable the <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
> segmenter by
    adding this command into both 
    <TT
CLASS="filename"
>indexer.conf</TT
> and <TT
CLASS="filename"
>search.htm</TT
>:
<PRE
CLASS="programlisting"
>&#13;Segmenter cjk
</PRE
>
    </P
><P
>&#13;    The <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
> segmenter considers all
    ideogram characters from the Unicode blocks
    <TT
CLASS="literal"
>CJK Ideographs Extension A (U+3400 - U+4DB5)</TT
>
    and  <TT
CLASS="literal"
>CJK Ideographs (U+4E00 - U+9FA5)</TT
>
    as separate words. 
    When indexing a document using the <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
>
    segmenter, <SPAN
CLASS="application"
>mnoGoSearch</SPAN
>
    stores information about every ideogram character separately.
    </P
><P
>&#13;    At search time, the search query you type
    is preprocessed by the <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
> sementer
    and some delimiters are inserted between the ideograms.
    </P
><P
>&#13;    If you pass the <TT
CLASS="literal"
>m=phrase</TT
> 
    query string parameter to <SPAN
CLASS="application"
>search.cgi</SPAN
>
    (which means <TT
CLASS="literal"
>exact phrase search</TT
>),
    the <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
> segmenter uses the 
    dash character as a delimiter, and the space character
    otherwise (that is in case of <TT
CLASS="literal"
>all words</TT
>
    and <TT
CLASS="literal"
>any of the words</TT
> search modes).
    </P
><P
>&#13;    Imagine you type the query ``<KBD
CLASS="userinput"
>ABCD</KBD
>'',
    where <TT
CLASS="literal"
>A</TT
>, <TT
CLASS="literal"
>B</TT
>,
    <TT
CLASS="literal"
>C</TT
>, <TT
CLASS="literal"
>D</TT
> are
    some ideographic characters. In case when
    the <TT
CLASS="literal"
>exact phrase search</TT
> mode is
    <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>not</I
></SPAN
> active, your query will be
    preprocessed by the <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
>
    segmenter to ``<TT
CLASS="literal"
>A B C D</TT
>'' and 
    the four individual "words" will be searched. Note, that
    <SPAN
CLASS="application"
>mnoGoSearch</SPAN
> ranks the documents
    will smaller distance between the query words higher than
    the documents having the same words in different parts
    of the document, so if you have some documents 
    the exact phrase <TT
CLASS="literal"
>ABCD</TT
>,
    it is very likely that they will be 
    displayed in the top <TT
CLASS="literal"
>10</TT
> results.
    </P
><DIV
CLASS="note"
><BLOCKQUOTE
CLASS="note"
><P
><B
>Note: </B
>
      You can try different values for the
      <A
HREF="msearch-cmdref-worddistanceweight.html"
>WordDistanceWeight</A
> command
      to see how distances between the query words
      in the found documents affect their final score.
      </P
></BLOCKQUOTE
></DIV
><P
>&#13;    Now imagine you type the same query ``<KBD
CLASS="userinput"
>ABCD</KBD
>''
    with the <TT
CLASS="literal"
>exact phrase search</TT
>
    mode <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>enabled</I
></SPAN
>. The query will be
    preprocessed by the <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
>
    segmenter to ``<TT
CLASS="literal"
>A-B-C-D</TT
>''.
    The dash character forces automatic phrase
    search (see <A
HREF="msearch-doingsearch.html#search-phrase"
>the Section called <I
>Phrase search
    <A
NAME="AEN4873"
></A
></I
> in Chapter 10</A
> for details
    on automatic phrase search), so as a result
    only those documents with exact phrase match will be found.
    </P
><DIV
CLASS="note"
><BLOCKQUOTE
CLASS="note"
><P
><B
>Note: </B
>
    You can also use the ordinary <SPAN
CLASS="application"
>mnoGoSearch</SPAN
>
    query syntax with quotes to enable phrase searches without having
    to pass the <TT
CLASS="literal"
>m=all</TT
> query string variable
    (<TT
CLASS="literal"
>exact phrase search</TT
> mode) .
    For example, if you type ``<TT
CLASS="literal"
>"AB" "CD"</TT
>'',
    then the documents having the ideogram <TT
CLASS="literal"
>A</TT
>
    immediately followed by the ideogram <TT
CLASS="literal"
>B</TT
>,
    and at the same time, the ideogram <TT
CLASS="literal"
>C</TT
>
    immediately followed by the ideogram <TT
CLASS="literal"
>D</TT
>
    will be found. The mutual positions of the phrases
    <TT
CLASS="literal"
>AB</TT
> and <TT
CLASS="literal"
>CD</TT
> will
    not affect the result set, and will affect only the
    result ordering.
    </P
></BLOCKQUOTE
></DIV
><P
>&#13;    Although, the <ACRONYM
CLASS="acronym"
>CJK</ACRONYM
> phrase segmenter
    is not aware of the real word boundaries, tests made
    by the native speakers indicated that in many
    cases it works even better and more predictable
    than the
    <SPAN
CLASS="application"
>Mecab</SPAN
>-based,
    <SPAN
CLASS="application"
>Chasen</SPAN
>-based,
    and the frequency-based segmenters.
    </P
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="msearch-multilang.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="msearch-vary.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Search pages with multi-lingual interface
    <A
NAME="AEN3955"
></A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="msearch-international.html"
ACCESSKEY="U"
>Up</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Indexing multilingual servers</TD
></TR
></TABLE
></DIV
><!--#include virtual="body-after.html"--></BODY
></HTML
>