Sophie

Sophie

distrib > Mandriva > 2010.0 > i586 > media > contrib-release > by-pkgid > ed4950ee216151219bf3841700ccc7f8 > files > 28

gp-0.26-6mdv2010.0.i586.rpm








<html><head><title>GP</title>

<link rev="made" href="mailto:january@bioinformatics.org">
</head>
<body bgcolor="#FFFFFF" link="#FF0000">

<hr>

<h1>GP</h1>
<h2>GP</h2>
<h2>April 2000</h2>


    
<h2>NAME</h2>
    <strong>GP</strong> - utilities to manipulate DNA / RNA / protein sequences
<p>Copyright (C) 2000 January Weiner III &lt;january@bioinformatics.org&gt;
<p><strong>GP</strong> homepage: 
<p><code>http://www.bioinformatics.org/genpak/</code>)
<p><h2>LICENSE</h2>
    
<p><strong>GP</strong> is GPL'ed. Please read the file LICENSE.TXT for details.
<p><h2>DESCRIPTION</h2>
    
<p><strong>GP</strong> is a set of small utilities written in ANSI C to manipulate DNA
sequences in a Unix fashion, fit for combining within shell and cgi
scripts. Some exemplary cgi scripts are provided in the cgi directory. I
have done this utilities for myself and found them very useful for my work;
they are fast and quite reliable, and playing with large numbers of
sequences is much more convenient then with standard GUI tools. Feel free
to mail me bug reports and suggestions.
<p>The sequences are usually in fasta format, that means the first line is the
sequence name starting with "&gt;", and the sequence comes in the next lines. The
programs accept also gzipped sequence files (that is, if zlib support was
defined at compile time, which is default).
<p>Upon installation, <strong>GP</strong> creates a directory where it stores all it's
data. As a default, it is the /usr/lib/genpak directory. If one of the
programs cannot find a file which is given as the argument, it looks for it
in this particular directory, and only if it is not there it exits with an
error. You can put some shared files into this directory; note that they
will not be erased upon deinstallation or reinstallation. However, in the
latter case they might well get overwritten if you substituted the original
<strong>GP</strong> files by your own.
<p>All programs share some common options:
<p><dl>
<p><li > <strong>-h</strong> prints out a quick summary of options
<p><li > <strong>-H</strong> output in HTML mode
<p>Some programs can print nicely formatted tables or produce some other HTML
		specialized output. All programs collect warning messages and display them as
		the last thing before exiting. Do not use this option if you intend to feed
		other programs with the standard output.
<p><li > <strong>-v</strong> prints version information
<p><li > <strong>-q</strong> supresses all error messages ("quiet")
<p><li > <strong>-d</strong> prints out debugging information
<p></dl>
<p>Most programs accept also standard input (that was one of the main points
why I wrote those utilities anyway), and per default spawn the results to
standard output. This way, you have several methods of accessing the
programs:
<p>cat sequence.fasta | program &gt; program.output
<p>some_other_program | program | yet_another_program
<p>program input.file output.file
<p>program
<p>In the latter case, you have to type in or paste any data the program expects to
find on the standard input, and the program spawns the processed data
directly on the screen.
<p>In most cases, you can use multiple sequences stored in one file in a fasta
format fashion. The programs which require a sequence file will work until
all the sequences that can be retrieved from an input (=file or standard
input) are processed.
<p><h2>LIST OF PROGRAMS</h2>
    
<p><dl>
<li > <a href="README.html#gp_qs">gp_qs</a>
<li > <a href="README.html#gp_getseq">gp_getseq</a>
<li > <a href="README.html#gp_gc">gp_gc</a>
<li > <a href="README.html#gp_map">gp_map</a>
<li > <a href="README.html#gp_tm">gp_tm</a>
<li > <a href="README.html#gp_matrix">gp_matrix</a>
<li > <a href="README.html#gp_mkmtx">gp_mkmtx</a>
<li > <a href="README.html#gp_shift">gp_shift</a>
<li > <a href="README.html#gp_randseq">gp_randseq</a>
<li > <a href="README.html#gp_cusage">gp_cusage</a>
<li > <a href="README.html#gp_seq2prot">gp_seq2prot</a>
<li > <a href="README.html#gp_findorf">gp_findorf</a>
<li > <a href="README.html#gp_slen">gp_slen</a>
<li > <a href="README.html#gp_dimer">gp_dimer</a>
<li > <a href="README.html#gp_trimer">gp_trimer</a>
<li > <a href="README.html#gp_pattern">gp_pattern</a>
<li > <a href="README.html#gp_acc">gp_acc</a>
<li > <a href="README.html#gp_scan">gp_scan</a>
<li > <a href="README.html#gp_pars">gp_pars</a>
</dl>
<p>Here are the short program descriptions. Take a look at their <a href="README.html#manuals">respective
manual pages</a> or html documentation to obtain more informations.
<p><dl>
<li > <strong>gp_qs</strong> <a name="gp_qs"></a>
<p>find fast a sequence within a larger sequence, and print out the
	positions. Sometimes you just don't need blasta -- like, when you
	want only to know where exactly your primer binds in a given
	sequence. You can either type the sequence directly as a command
	line argument, like
<p><code>gp_qs ACTGACTG [sequence filename]</code>
<p>or give a filename in command line as an argument.
<p><li > <strong>gp_getseq</strong> <a name="gp_getseq"></a>
<p>retrieves quickly a sequence fragment. Usage is simple:
	<code>gp_getseq Position1 Position2 [sequence filename]</code>
	Note that if Position2 &gt; Position1, the retrieved sequence is
	complementary to the fragment Position1...Position2. 
	Position1 is the number of the first base to be
	retrieved, and Position2 is the last base to be retrieved.
<p><li > <strong>gp_gc</strong> <a name="gp_gc"></a>
<p>Prints out the GC content of a given sequence or sequences. Can
	also computate mean and SE for larger number of sequences.
<p><li > <strong>gp_map</strong> <a name="gp_map"></a>
<p><strong>gp_map</strong> generates automatically graphical gene maps. You provide a simple
	input -- a list of genes, their positions, maybe some parameters -- and the
	program outputs a PNG graphics showing the gene map. If the <strong>-H</strong> option is
	specified, additionaly an IMAP file is created: this allowes the creation of
	clickable, graphical maps created on the fly.
<p><li > <strong>gp_tm</strong> <a name="gp_tm"></a>
<p>Prints out the Tm of a given sequence. Three algorithms can be used: the exact
	nearest neighbor algorithm, the approximate GC contents algorithm, and the
	evil and false 4*[GC] + 2*[AT] algorithm.
<p><li > <strong>gp_matrix</strong> <a name="gp_matrix"></a>
<p><strong>Matrix</strong> is a program to look for promoters in a set of sequence
	files, using the Staden matrix (see: Hertz, G. and Stormo, G.D.
	1996. Escherichia coli promoter sequences: analysis and prediction.
	Meth. Enzym. 273). Basically, you have a matrix file containing
	scores and penalties for nucleotides at different positions in the
	supposed -35 and -10 boxes, as well in the +1 region of a given
	sequence (see the file "matryca" in the data/ directory, which is
	the same as the E. coli matrix published in Hertz et al.). 
<p>The program loads sequences from the sequence file, and then scans
	it using all possible combinations of gap lengths between the +1,
	-10 and -35 boxes
	and at all possible positions in the sequence so as to find this
	combination which gives the highest score for the sequence. It then
	prints a formatted output in the following form:
<p><code>#score sequence...[-35 core]...[-10 core]...[start]...</code>
<p>The '|' characters denote the boundaries of matrix'ed fragments.
<p>In the "data" directory you will find the original Staden <em>E.
	coli</em> matrix. The <strong>myco.mtx</strong> <code>Mycoplasma pneumoniae</code> matrix and the
	program have been described in Weiner, J. et al. 2000, "Transcription in
	<code>Mycoplasma pneumoniae</code>".
<p><li > <strong>gp_mkmtx</strong> <a name="gp_mkmtx"></a>
<p>creates nucleotide frequency matrices, such as that which are used by the
	<strong>gp_matrix</strong> program.
<p><li > <strong>gp_shift</strong> <a name="gp_shift"></a>
<p>sometimes you have a list of genes:
<p><pre>

		100000 101000 gene1
		200000 201000 gene2
		400000 391000 gene3
		...
</pre>

<p>...and would like to, for example, print out the promoter regions, that is,
	sequences from -100 to +10 relative to the 5'-end of the genes. <strong>gp_shift</strong>
	is useful for this.
<p><li > <strong>gp_randseq</strong> <a name="gp_randseq"></a>
<p>unless the option -r is set, it prints out random fragments from a
	sequence file. Default fragment length is 100, and you can change
	it with the option -l length. If you set -r, however, completly
	random sequences are provided. You can determine their GC content
	with the option -g value. There is also an option -m, which stands
	for "Markov chains", but all it does is to assure that the
	probability of selecting a nucleotide depends on what is the
	previous nucleotide; this probabilities are also taken out from a
	sequence file.
<p><li > <strong>gp_seq2prot</strong> <a name="gp_seq2prot"></a>
<p>Converts a nucleotide sequence to protein sequence. Sequence is
	supposed to start with a start codon: this is mandatory. Lacking of
	the stop codon or premature end of input sequence (like, in the
	middle of a codon) results only in a warning message. 
<p>You can provide your own codon tables; 
	for the format of the codon_file look at data/standard.cdn and
	data/myco.cdn. Basically, you need not to provide the whole table,
	it is enough to point out the differences. To see a codon file,
	type <code>gp_seq2prot -p</code>.
<p><li > <strong>gp_findorf</strong> <a name="gp_findorf"></a>
<p>Prints out all ORFs that are contained in a sequence.
	<strong>gp_findorf</strong> looks always for the longest ORF within the given
	limit. See also notes for <strong>gp_seq2prot</strong>.
<p><li > <strong>gp_cusage</strong> <a name="gp_cusage"></a>
<p>Prints out the codon usage of sequence(s). Same options as in the
	case of gp_seq2prot; actually -- this *is* nearly the same program. I
	just like them to have separately. 
<p><li > <strong>gp_slen</strong> <a name="gp_slen"></a>
<p>Sequence length. Sometimes useful. Can also computate mean and SE of a set of
	sequences.
<p><li > <strong>gp_dimer</strong> <a name="gp_dimer"></a>
<p>record frequencies of nucleotide pairs: AA, AC, AG...TT. This is sometimes
	useful for characterizing a sequence. You can also record frequencies of
	nucleotide pairs <code>separated</code> by a given number of nucleotides, to check, for
	example, how often an 'A' comes five nucleotides downstream of an 'T'. Believe
	me or not, it <code>is</code> useful.
<p><li > <strong>gp_trimer</strong> <a name="gp_trimer"></a>
<p>record frequencies of nucleotide trimers: AAA, AAC, AAG...TTT.
<p><li > <strong>gp_pattern</strong> <a name="gp_pattern"></a>
<p>record frequencies of patterns of a given length. Note that the number of
	possible patterns increases exponentially with each basepair, that is, for a
	tetramere there are 4^4 = 256 possible patterns.
<p><li > <strong>gp_acc</strong> <a name="gp_acc"></a>
<p>this program can be used to convert a sequence into a set of so-called
	auto-cross-correlation coefficients which can be further analised by, for
	example, principle component analysis (PCA). If you want to learn more about
	it, read Jonsson et al., 1991, "A multivariate representation and analysis of
	DNA sequence data".
<p><li > <strong>gp_scan</strong> <a name="gp_scan"></a>
<p><strong>gp_scan</strong> is used to further analyse the auto-cross-correlation terms to
	find out some more information about patterns or regularities using in
	sequence.
<p><li > <strong>gp_pars</strong> <a name="gp_pars"></a>
<p>This program shows that I'm hopeless and don't know anything about
	Un*x tools. All <strong>pars</strong> does is to change the "%0D%0A" string into
	a newline character, because I couldn't find a way around that
	using <strong>sed</strong>(1). 
<p></dl>
<p><h2>THANKS</h2>
    
<p>Many thanks go to all good souls from comp.lang.c, whose advice was
	necessary to do all those programs and to, and Hinrich W. H.
	G&ouml;hlmann and Steve Brewer for ecouraging me in my work.
<p><h2>NOTE FROM AUTHOR</h2>
    
<p>I'm not a programmer, and <strong>GP</strong> is amateur work. Everything
	started because I found myself constantly writing small utilities
	which could do batch jobs for me, instead of using packages like
	<strong>DNA Star</strong>. Graphical user interface is OK, as long you don't
	have to process like 677 sequences -- and 677 is a number which
	occurs often during my work, because it is the number of genes in
	the <em>Mycoplasma pneumoniae</em> genome I am working on. There are
	also many Unix tools, but they are either hard to use, or to
	install, or do not even compile on my Linux boxes. 
<p>Originally, the package name was <code>GP</code>, but there is some company named
	like that, so I changed most of the names to <strong>GP</strong>.
<p>The programs, I'm sure, have lots of bugs and poor code. For
	example, I never got the Makefile to work properly. So if you can
	help me make <strong>GP</strong> a little better, do so -- and mail me.
<p><a name="manuals"></a>
<p><h2>SEE ALSO</h2>
    
<a href="index.html">Genpak(1)</a> 
<a href="gp_acc.html">gp_acc(1)</a> 
<a href="gp_cusage.html">gp_cusage(1)</a> 
<a href="gp_digest.html">gp_digest(1)</a> 
<a href="gp_dimer.html">gp_dimer(1)</a> 
<a href="gp_findorf.html">gp_findorf(1)</a> 
<a href="gp_gc.html">gp_gc(1)</a> 
<a href="gp_getseq.html">gp_getseq(1)</a> 
<a href="gp_map.html">gp_map(1)</a> 
<a href="gp_matrix.html">gp_matrix(1)</a> 
<a href="gp_mkmtx.html">gp_mkmtx(1)</a> 
<a href="gp_pattern.html">gp_pattern(1)</a> 
<a href="gp_qs.html">gp_qs(1)</a> 
<a href="gp_randseq.html">gp_randseq(1)</a> 
<a href="gp_seq2prot.html">gp_seq2prot(1)</a> 
<a href="gp_slen.html">gp_slen(1)</a> 
<a href="gp_tm.html">gp_tm(1)</a> 
<a href="gp_trimer.html">gp_trimer(1)</a> 
<p><h2>DIAGNOSTICS</h2>
    
<p>All <strong>Genpak</strong> programs complain in situations you would also complain,
like when they cannot find a sequence you gave them or the sequence is not
valid. 
<p>The <strong>Genpak</strong> programs do not write over existing files. I have found this
feature very useful :-)
<p><h2>BUGS</h2>
    
<p>I'm sure there are plenty left, so please mail me if you find them. I tried
to clean up every bug I could find.
<p><h2>AUTHOR</h2>
    
<p>January Weiner III
		<a href="mailto:january@bioinformatics.org">&lt;january@bioinformatics.org&gt;</a>    
</body>
</html>