Documentation read from 05/19/2023 16:24:13 version of /vol/kmer-server-prod/FIGdisk.server.rhel6/dist/releases/dev/common/lib/FigKernelPackages/ANNOserver.pm.

Annotation Support Server Object

This file contains the functions and utilities used by the Annotation Support Server (anno_server.cgi). The various methods listed in the sections below represent function calls direct to the server. These all have a signature similar to the following.

    my $results = $annoObject->function_name($args);

where $annoObject is an object created by this module, $args is a parameter structure, and function_name is the Annotation Support Server function name. The output $results is a scalar, generally a hash reference, but sometimes a string or a list reference.

Constructor

Use

    my $annoObject = ANNOserver->new();

to create a new annotation support server function object. The function object is used to invoke the "Primary Methods" listed below. See SAPserver for more information on how to create this object and the options available.

Primary Methods

Functions

metabolic_reconstruction

    my $results = $annoObject->metabolic_reconstruction({
                                -roles => { [$role1, $id1],
                                            [$role2, $id2],
                                            ... });

This method will find for each subsystem, the subsystem variant that contains a maximal subset of the roles in an incoming list, and output the ID of the variant and a list of the roles in it.

parameters

The single parameter is a reference to a hash containing the following key fields.

-roles: Reference to a list of 2-tuples, each consisting of the functional role followed by an arbitrary ID of the caller's choosing (e.g., a gene name, a sequence-project gene ID, a protein ID or whatever).

For backward compatibility, instead of a hash reference you may specify a simple reference to a list of 2-tuples.

RETURN

Returns a list of tuples, each containing a variant name (subsystem name, colon, variant code), a role ID, and optionally a caller-provided ID associated with the role. The ability to specify arbitrary IDs to be associated with the roles is normally used to associate arbitrary gene IDs with the roles they are believed to implement.

find_special_proteins

    my $proteinList =   $annoObject->find_special_proteins({
                                -contigs => [[$contigID1, $contigNote1, $contigDNA1],
                                             [$contigID2, $contigNote2, $contigDNA2],
                                             ...],
                                -is_init => [$codon1a, $codon1b, ...],
                                -is_alt => [$codon2a, $codon2b, ...],
                                -is_term => [$codon3a, $codon3b, ...],
                                -comment => $commentString,
                                -templates => [[$protID1, $protNote1, $protSeq1],
                                              [$protID2, $protNote2, $protSeq1],
                                              ...]
                        });

This method searches for special proteins in a list of contigs. The method is specifically designed to find selenoproteins and pyrrolysoproteins, but custom protein templates can be specified to allow searching for any type of protein family.

parameter

The parameter is a reference to a hash with the following permissible keys.

-contigs: Reference to a list of contigs. Each contig is represented by a 3-tuple consisting of a contig ID, a comment, and a DNA string.
-is_init (optional): Reference to a list of DNA codons to be used as start codons. The default is ATG and GTG.
-is_alt (optional): Reference to a list of DNA codons to be used as alternative start codons. These are employed if there are no results from the main start codons. The default is TTG.
-is_term (optional): Reference to a list of DNA codons to be used as stop codons. The default is TAA, TAG, and TGA.
-templates (optional): Description of the type of special protein being sought. If pyrrolysoprotein, then the method will search for pyrrolysines. If selenoprotein, then the method will search for selenoproteins. Otherwise, should be a reference to a list of 3-tuples containing templates for the proteins in the desired family. Each 3-tuple must consist of an ID, a functional role description, and a protein sequence. The default is selenoprotein.
-comment (optional): A string that will be inserted as a comment in each element of the output list. The default is either pyrrolysoprotein or selenoprotein, depending on the template specification.

RETURN

Returns a reference to a list of hashes. Each hash contains the following keys.

location: A location string describing the contig, start, and end location of the protein found.
sequence: The protein sequence found.
reference_id: ID of the relevant template protein sequence.
reference_def: Functional role of the relevant template protein sequence.
comment: Comment from the input parameters.

assign_function_to_prot

    my $resultHandle = $annoObject->assign_function_to_prot($args)

For each incoming protein sequence, attempt to assign a function. There are two ways functions can get assigned. The first is based on kmers, and these are normally viewed as the most reliable (at least they give a consistent vocabulary!). If no kmer match is made, you can optionally try to make an assignment based on similarity computations.

The attempt is made using kmer-technology. A pass through the sequence will locate "signature kmers", and scores will be computed. The scores are based on the number of nonoverlapping hits, the number of overlapping hits, and the difference in counts between hits against the most probable function's kmer-set and the next most probable function's kmer set. Basically, we compute all matching kmers. Then, we split them into sets based on the predictions each would make (each kmer, in effect, predicts a single function). One threshhold (the scoreThreshold) is the difference between total number of overlapping hits for the "best function" versus the total number for the "next best". hitTheshold is the number of overlapping hits required for the "best function". Similarly, seqHitThreshold is the minimum number of non-overlapping hits.

Now, to add complexity, these thresholds are based on counting "1" for each matched Kmer. That is, the scoreThreshold is normally thought of as a difference in the number of occurrences. However, you may wish to "normalize" this threshold by dividing the count by the length of the sequence. This then gives scores between 0 and 1, rather than between 0 and the length of the sequence (-K if you wish to be pedantic).

args

Reference to a hash containing the parameters. The allowable parameter fields are as follows.

-input: Either (1) an open input handle to a file containing the proteins in FASTA format, or (2) A reference to a list of sequence data entries. Each entry is a triple of strings [sequence-id, comment, protein-sequence-data].
-kmer: Specify the kmer size to use for analysis (valid sizes are 7 - 12).
-assignToAll: If TRUE, then if the standard matching algorithm fails to assign a protein, a similarity-based assignment algorithm will be used instead.
-scoreThreshold N: Require a Kmer score of at least N for a Kmer match to succeed.
-hitThreshold N: Require at least N (possibly overlapping) Kmer hits for a Kmer match to succeed.
-seqHitThreshold N: Require at least N (non-overlapping) Kmer hits for a Kmer match to succeed.
-normalizeScores 0|1: Normalize the scores to the size of the protein.
-detailed 0|1: If true, return a detailed accounting of the kmers hit for each protein.

RETURN

Returns a Result Handle. Call get_next on the result handle to get back a data item. Each item sent back by get_next is a 7-tuple containing the results. Each tuple is of the form

    [ sequence-id, assigned-function, genome-set-name, score, non-overlapping hit-count, overlapping hit-count, detailed-hits]

where detailed-hits is undef unless the -detailed option was used.

If details were requested, the detailed-hit list is a list of tuples, one for each kmer hit. These tuples have the form

    [ offset, oligo, functional-role, genome-set-name]

call_genes

    my $result = $annoObject->call_genes($args);

Call the protein-encoding genes for the specified DNA sequences.

args

Reference to a hash containing the parameters. The allowable parameter fields are as follows.

-input: The DNA sequences to be analyzed. This may take one of two forms: (1) a file handle that is open for reading from a file of DNA sequences in FASTA format, or (2) a reference to a list of DNA data entries. Each entry is a triple of strings [sequence-id, comment, dna-sequence-data].
-trainingLocations (optional): Reference to a hash mapping gene IDs to location strings. The location strings in this case are of the form contig_start_end. The locations indicated should be coding regions in the incoming sequences to be analyzed (or in the training contigs if a -trainingContigs parameter is specified). If this parameter is omitted, then the default GLIMMER training algorithm will be used.
-trainingContigs (optional): The contigs in which the -trainingLocations can be found. This may take one of two forms: (1) a file handle that is open for reading from a file of DNA sequences in FASTA format, or (2) a reference to a list of contig data entries. Each entry is a triple of strings [contig-id, comment, contig-sequence-data].
-minContigLen (optional): Shortest-length contig considered to be valid. This is used to prevent attempting to call genes in contigs too short to have any complete ones. The default is 2000.
-geneticCode (optional): The numeric code for the mapping from DNA to amino acids. The default is 11, which is the standard mapping and should be used in almost all cases. A complete list of mapping codes can be found at http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi.

RETURN

Returns a 2-tuple consisting of 1) a string containing what would normally be the contents of an entire FASTA file for all the proteins found followed by 2) a reference to a list of genes found. Each gene found will be represented by a 4-tuple containing an ID for the gene, the ID of the contig containing it, the starting offset, and the ending offset.

find_rnas

    my $document = $annoObject->find_rnas($args)

Call the RNAs for the specified DNA sequences.

args

Reference to a hash containing the parameters. The allowable parameter fields are as follows.

-input: Open input handle to a file containing the DNA sequences in FASTA format.
-genus: Common name of the genus for this DNA.
-species: Common name of the species for this DNA.
-domain: Domain of this DNA. The default is Bacteria.
-rnas: Type of RNA desired. Allowed values are all or any of tRNA, SSU, LSU, 5S joined by commas. Default is all.

RETURN

Returns a 2-tuple consisting of 1) a string containing what would normally be the contents of an entire FASTA file for all the RNA genes found followed by 2) a reference to a list of RNA genes found. Each gene found will be represented by a 5-tuple containing an ID for the gene, the ID of the contig containing it, the starting offset, the ending offset, and the name of the RNA found.

assign_functions_to_dna

    my $result = $annoObject->assign_functions_to_dna($args)

Analyze DNA sequences and output regions that probably belong to FIGfams. The selected regions will be high-probability candidates for protein encoding sequences.

args

Reference to a hash containing the parameters. The allowable parameter fields are as follows.

-input

The sequences to be analyzed. This may take one of two forms:

1. An file handle that is open for reading from a file of DNA sequences in FASTA format, or

2. A reference to a list of sequence data entries. Each entry is a triple of strings [sequence-id, comment, dna-sequence-data].

-kmer

Specify the kmer size to use for analysis (valid sizes are 7 - 12).

-minHits

A number from 1 to 10, indicating the minimum number of matches required to consider a protein as a candidate for assignment to a FIGfam. A higher value indicates a more reliable matching algorithm; the default is 3.

-maxGap

When looking for a match, if two sequence elements match and are closer than this distance, then they will be considered part of a single match. Otherwise, the match will be split. The default is 600.

RETURN

Returns a Result Handle. Call get_next on the result handle to get back a data item. Each item sent back by the result handle is a 2-tuple containing an incoming contig ID and a reference to a list of hit regions. Each hit region is a 5-tuple consisting of the number of matches to the function, the start location, the stop location, the proposed function, and the name of the Genome Set (OTU) from which the gene is likely to have originated.