Sophie

Sophie

distrib > Mandriva > 2010.0 > i586 > media > contrib-release > by-pkgid > 6a7802c6c3139de476e21dfe96b215ad > files > 28

brilltagger-1.14-10mdv2010.0.i586.rpm

/*  Copyright @ 1993 MIT and the University of Pennsylvania*/
/* Written by Eric Brill */
 brill@goldilocks.lcs.mit.edu
 (After July 1 1994, my email address will be brill@blaze.cs.jhu.edu)
 Feel free to contact me with any questions, comments, . . .
===============================================================
THIS SOFTWARE IS PROVIDED "AS IS", AND M.I.T. MAKES NO REPRESENTATIONS 
OR WARRANTIES, EXPRESS OR IMPLIED.  By way of example, but not 
limitation, M.I.T. MAKES NO REPRESENTATIONS OR WARRANTIES OF 
MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF 
THE LICENSED SOFTWARE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY 
PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
===============================================================
You MUST read the copyright file included with the tagger (COPYRIGHT)
before doing anything else. 
===============================================================

To compile the programs, type (in the tagger base directory):

 make

or first edit the Makefile to suit your needs.

===============================================================

(If you have altered the file structure of the tagger after untarring
the programs, then you will have to adjust the instructions
accordingly). 

First cd into the directory Bin_and_Data/.

To execute the program, type:

tagger LEXICON YOUR-CORPUS BIGRAMS LEXICALRULEFULE CONTEXTUALRULEFILE 

where YOUR-CORPUS is the file name of the corpus you wish to have
tagged, and the other files are all provided with the tagger.

Options (which are typed after the file names) are:

-h             :: help

-w wordlist    :: provide an extra set of words beyond those in LEXICON.
	          For more information about this, read README.LONG.

-i filename    :: writes intermediate results from start state tagger
		  into filename

-s number      :: processes the corpus to be tagged "number" lines at
		  a time.  This should be specified if memory problems
	          result from trying to process too large a corpus at
		  once.  For more information about this, read
		  README.LONG. 

-S 	       :: use start state tagger only.

-F             :: use final state tagger only.  In this case,
	         YOUR-CORPUS is a tagged corpus, whose taggings will
	         be changed according to the final-state-tagger
	         contextual rules.  YOUR-CORPUS should be a tagged
	         corpus ONLY when using this option.
			
	
	
The tagger writes to standard output.

========================================================================


IMPORTANT: Tokenization and vocabulary follow the Penn Treebank (since
this is what the tagger was trained on).  Punctuation must be removed
from words, etc.  Example:

We  're going today , are you ?
`` I 'm hungry , '' he said .

This could be fixed by adding lexical entries for we're, I'm, ", etc.

Also, one sentence per line is assumed.

===================================================================