Sophie

Sophie

distrib > Mandriva > 2010.0 > i586 > media > contrib-release > by-pkgid > 6a7802c6c3139de476e21dfe96b215ad > files > 29

brilltagger-1.14-10mdv2010.0.i586.rpm

======================================================================
Copyright (c) 1993 Massachusetts Institute of Technology and The
University of Pennsylvania.  All rights reserved.
Written by Eric Brill (brill@goldilocks.lcs.mit.edu)
(After July 1 1994, my email address will be brill@blaze.cs.jhu.edu)
Feel free to contact me with any questions, comments, . . .
========================================================================
THIS SOFTWARE IS PROVIDED "AS IS", AND M.I.T. MAKES NO REPRESENTATIONS 
OR WARRANTIES, EXPRESS OR IMPLIED.  By way of example, but not 
limitation, M.I.T. MAKES NO REPRESENTATIONS OR WARRANTIES OF 
MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF 
THE LICENSED SOFTWARE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY 
PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
========================================================================

The rule-based part of speech tagger works by first assigning each
word its most likely tag, and then changing word taggings based on
contextual cues.  There are two stages in training: 

(1) Rules are learned to predict the most likely tag for unknown
words.  (Example: if a word ends in "ed", it is probably a past tense
verb).  These rules operate on word types.  If the outcome of applying
these rules is that a word should be tagged with a particular tag,
this holds for all occurrences of the word in the corpus.  

(2) Rules are learned to use contextual cues to improve tagging accuracy.
(Example: change the tag of a word from verb to noun if the previous
word is tagged as a determiner).  These rules operate on individual
word tokens.

=================================================================
IMPORTANT: The tagger does no tokenization.  In all text files used
for training (both tagged and untagged text), the text should be
pretokenized, and should be in one sentence per line.  Tokenization
includes removing punctuation from words, so:

    John (I think) said: "Who are you?"  He then gave me $10.

would become:

John ( I think ) said : " Who are you ? "
He then gave me $ 10 .

Tagged text is of the form: word/tag.  Below is a sample tagged
sentence from the Penn Treebank Brown Corpus:

This/DT is/VBZ not/RB a/DT program/NN of/IN socialized/VBN medicine/NN ./. 

==================================================================

For the discussion below, we assume there already exists a large
manually tagged corpus from which to train.  If instead you are
starting from scratch (which is more likely the case), you can
iteratively build the tagger as follows: first manually tag a small
amount of your text corpus.  Then train the tagger as described below.
Then use the trained tagger to tag more text.  Then manually correct
the output of the tagger (which should be considerably faster than
manually tagging from scratch).  Then retrain the tagger on all of the
available manually tagged text.  And so on, until either the tagger
appears to be sufficiently accurate, or you run out of patience.

Given a manually tagged corpus, randomly split it, sentence by
sentence, into two corpora of equal size.  (This can be done using the
program Utilities/divide-in-two-rand.prl, provided with this release).
The first of these tagged corpora will be used for learning rules to
predict the most likely tag for unknown words, and the second will be
used for learning contextual rules.  

There is a trade-off between accuracy and training time.  The larger
the training subcorpora are, the more accurate tagging will be.
However, training on a very large corpus can take a lot of time.  On
my Sparc-10, training the unknown word module on 250,000 words of
manually tagged text takes about 3 days, and training the contextual
cues module on another 500,000 word corpus takes about 1 day (with
high thresholds: see below).  Training on smaller corpora is much much
faster.  If training time matters, you might want to examine more
efficient implementations or consider parallelizing the programs.  Or,
set the threshold for rule learning to a higher number.  Rule scores
seem to be Zipf-like.  Changing the threshold from 2 to 3 (meaning
that training halts when no rule can be found whose score is >= 3)
usually results in 1/2 the number of learned rules, with only a
marginal decrease in performance.

=======================================================================

LEARNING RULES TO PREDICT THE MOST LIKELY TAG FOR UNKNOWN WORDS:

This learning module uses the first half of the manually tagged
corpus, as well as all available untagged text.  The untagged text
will include the entire manually tagged text (both halves), with tags
removed (the utility Utilities/tagged-to-untagged.prl can be used for
this purpose), as well as any additional text that is available (of
the same type).  In the remainder of this section, we will refer to
the first half of the manually tagged corpus as TAGGED-CORPUS, and the
collection of all untagged text as UNTAGGED-CORPUS.


Before learning can proceed, a number of files must be created.  

(1) BIGWORDLIST is a list of all words occurring in UNTAGGED-CORPUS,
sorted by decreasing frequency.  To create this list, give the
command:

cat UNTAGGED-CORPUS | Utilities/wordlist-make.prl | sort +1 -rn | \
 awk '{ print $1}' > BIGWORDLIST


(2) SMALLWORDTAGLIST is a list of the form (word tag count) listing
the number of times a word is tagged with a tag in TAGGED-CORPUS.  For
example:

, , 1150
the DT 1005
. . 997
of IN 620
to TO 517
a DT 468
in IN 432

To create ths list, give the command:

cat TAGGED-CORPUS | Utilities/word-tag-count.prl \
	| sort +2 -rn > SMALLWORDTAGLIST 

(3) BIGBIGRAMLIST is a list of all word pairs occurring in
UNTAGGED-CORPUS.  For example:

large portion
software inventories
's surprise

To create this file, give the command:

cat UNTAGGED-CORPUS | Utilities/bigram-generate.prl | \
	awk '{ print $1,$2}' > BIGBIGRAMLIST

If space is a problem, you can throw out bigrams that occur only once
or twice by substituting the awk command with:

	awk '{ if ($3>2) print $1,$2}'


At this point, you are ready to run the rule learner.  Unfortunately,
the current version of this program is in Perl, and is very slow.
(All other programs are in C).  To run the program, give the command
(this learning program is located in the Learner_Code directory):

unknown-lexical-learn.prl BIGWORDLIST SMALLWORDTAGLIST BIGBIGRAMLIST \
	 300 LEXRULEOUTFILE

where BIGWORDLIST, SMALLWORDTAGLIST and BIGBIGRAMLIST are the files
created above, LEXRULEOUTFILE is the name of the file the learned rules
will be written to.  The number 300 is used for efficiency, telling
the learner to only use bigram contexts where one of the two words is
among the 300 most frequent words.  There is a trade-off of space/time
for training versus accuracy.  If you really care, you can experiment
with different numbers, increasing the number for potentially higher
accuracy, and decreasing it if the learner is too slow or is eating up
too much memory.  

The program can be run to completion or killed when "enough" rules are
learned.  The program terminates when the score of the best found rule
drops below a threshold.  This threshold can be altered in the source
code.  Look in unknown-lexical-learn.prl for the line:

$THRESHOLD = 3;

This table is a guesstimate of appropriate thresholds for different
sizes of manually tagged training corpora.  Use a lower number if you
are willing to let the program run longer for better performance and
higher if you want to sacrifice performance for quicker training.

		SIZE(words)              THRESHOLD
	        <50K 			     3
		50K-100K		     4
		100K-200K                    5
	        200K-400K		     6
		>400K                        8


After training, a list of rules will have been learned, which can be
used to predict the most likely tag for unknown words.  The real
numbers output in the last field of each rule are the scores of the
rules.  They only serve as place-holders in tagging, and can be
replaced with any character string (no numbers are actually used in
tagging).  To give you an idea of the meanings of the rules, here are
some example rules with glosses:

NN s fhassuf 1 NNS 3022.2549507603848724
  Change the tag from NN to NNS if the word has the suffix "s".
ed hassuf 2 VBN 1203.5435442040788985
  Change the tag to VBN if the suffix is "ed" (regardless of current tag)
  (note the difference between fhassuf and hassuf)
ly addsuf 2 JJ 578.41553589321256368
  Change the tag to JJ if adding the suffix "ly" results in a word.
- char JJ 425.28095238095232844
  Change the tag to JJ if the character "-" appears in the word.
NN to fgoodright VB 224.95212906910632
  Change the tag from NN to VB if the word is ever seen immediately to
	the right of the word "to".
un deletepref 2 JJ 67.245614035087726279
  Change the tag to JJ if deleting the prefix "un" results in a word.


The way this works in tagging is that each unknown word is assigned a
tag as an initial guess, and then each learned rule is applied, in
order.  The method for assigning the initial guess is hardcoded in 
both unknown-lexical-learn.prl and in start-state-tagger.c.  The
algorithm currently used is:

	If the word starts with a capital letter, assign the tag
	"NNP", otherwise assign the tag "NN".

If you want to change this, you must do so in two programs, 
   (1) unknown-lexical-learn.prl 
   (2) start-state-tagger.c 

(If you are also using unknown-lexical-learn-continue.prl, the change
must be made there as well).  Look for the comment "START STATE
ALGORITHM" in these programs for information on how to change this.
IMPORTANT: make sure both of these programs have the same start state
tagging algorithm!!


There is also a program entitled unknown-lexical-learn-continue.prl,
This program enables you to continue training, given a partially
completed list of learned rules.  If the rules that have been learned
are in the file OLDLEXRULEFILE, and you want the new rules to be
written to LEXRULEOUTFILE, then call:

unknown-lexical-learn-continue.prl BIGWORDLIST SMALLWORDTAGLIST \
	BIGBIGRAMLIST 300 LEXRULEOUTFILE OLDLEXRULEFILE

When training is completed, a full set of rules can be obtained by
concatenating LEXRULEOUTFILE to the end of OLDLEXRULEFILE.

======================================================================

LEARNING CONTEXTUAL CUES:

The next stage of learning involves learning rules to improve accuracy
based on contextual cues, after tagging all words with their most
likely tag.  Before learning, a couple of extra files must be created.
In the descriptions below, TAGGED-CORPUS refers to the same tagged
corpus used above (the first half of the entire tagged corpus),
TAGGED-CORPUS-2 refers to the second half (as yet unused), and
TAGGED-CORPUS-ENTIRE refers to the entire manually tagged corpus.

First, we must create a file TRAINING.LEXICON.  A lexicon for the
tagger contains lines of the form:

WORD  TAG1 TAG2 . . . TAGN

where tag1 is the most likely tag for WORD in the training corpus, and
tag2 through tagn are other possible tags (in no particular order).

Give the command (make-restricted-lexicon.prl is in the Utilities/
directory):

cat TAGGED-CORPUS | make-restricted-lexicon.prl > TRAINING.LEXICON 


In case you have divided your corpus differently, the file
TRAINING.LEXICON should be derived from all manually tagged text you
have available, excluding the portion TAGGED-CORPUS-2 that will be
used to train the contextual rules.

Next, 

cat TAGGED-CORPUS-ENTIRE | Utilities/make-restricted-lexicon.prl > \
	FINAL.LEXICON 

FINAL.LEXICON will then be the lexicon you use to tag text once all
training is completed.

After this, execute:

cat TAGGED-CORPUS-2 | tagged-to-untagged.prl > UNTAGGED-CORPUS-2

Then, (tagger is in the Bin_and_Data directory, and must be run from
that directory):

tagger TRAINING.LEXICON UNTAGGED-CORPUS-2 BIGBIGRAMLIST \
   LEXRULEOUTFILE /dev/null -w BIGWORDLIST  \
   -i DUMMY-TAGGED-CORPUS > /dev/null

(Or, you can give the following command, which is equivalent:
  tagger TRAINING.LEXICON UNTAGGED-CORPUS-2 BIGBIGRAMLIST \
     LEXRULEOUTFILE /dev/null -w BIGWORDLIST -S > DUMMY-TAGGED-CORPUS )

To learn contextual rules, execute (the executable
contextual-rule-learn is in the Bin_and_Data directory):

contextual-rule-learn TAGGED-CORPUS-2 DUMMY-TAGGED-CORPUS \
	CONTEXT-RULEFILE TRAINING.LEXICON

where CONTEXT-RULEFILE is the file rules will be written to.

Helpful hint:  If something goes wrong, and the program terminates
before completing the training, you do not have to restart.  You can
simply make a new DUMMY-TAGGED-CORPUS using the tagger and the set of
already learned rules, and then continue training from there.

This table is a guesstimate of appropriate thresholds for different
size manually tagged training corpora.  Use a lower number if you
are willing to let the program run longer for better performance and
higher if you want to sacrifice performance for quicker training.

		SIZE(words)              THRESHOLD
	        <50K 			     2
		50K-100K		     3
		100K-200K                    5
	        200K-400K		     8
		>400K                        15

=================================================================

For a more thorough description of the learning programs, see:

\bibitem[Bri92] E.~Brill.
\newblock A simple rule-based part of speech tagger.
\newblock In {\em Proceedings of the Third Conference on Applied Natural
  Language Processing, ACL}, Trento, Italy, 1992.

\bibitem[Bri93] E.~Brill.
\newblock {\em A Corpus-Based Approach to Language Learning}.
\newblock PhD thesis, Department of Computer and Information Science,
  University of Pennsylvania, 1993.

\bibitem[Bri94] E.~Brill.
\newblock Some advances in rule-based part of speech tagging
\newblock  Proceedings of the Twelfth National Conference on
   Artificial Intelligence (AAAI-94), Seattle, Wa., 1994.

(these papers are available from anonymous ftp to
lightning.lcs.mit.edu in pub/BRILL/Papers.  The last paper is included
with this distribution, in the Docs/ directory.  After July 1, 1994,
contact me at brill@blaze.cs.jhu.edu for information about how to
obtain these papers).