Documentation read from 05/19/2023 16:24:13 version of /vol/kmer-server-prod/FIGdisk.server.rhel6/dist/releases/dev/common/lib/FigKernelPackages/SeedUtils.pm.

SEED Utility Methods

SEED Utility Methods

SEED Utility Methods

Introduction

This is a simple utility package that performs functions useful for bioinformatics, but that do not require access to the databases.

Several methods deal with gene locations. Location information from the Sapling server is expressed as location strings. A location string consists of a contig ID (which includes the genome ID), an underscore, a starting location, a strand indicator (+ or -), and a length. The first location on the contig is 1.

For example, 100226.1:NC_003888_3766170+612 indicates contig NC_003888 in genome 100226.1 (Streptomyces coelicolor A3(2)) beginning at location 3766170 and proceeding forward on the plus strand for 612 bases.

Public Methods

abbrev

    my $abbrev = SeedUtils::abbrev($genome_name);

Return an abbreviation of the specified genome name. This method is used to create a reasonably indicative genome name that fits in 10 characters.

genome_name: Genome name to abbreviate.
RETURN: Returns a shortened version of the genome name that is 10 characters or less in length.

fields_of

    my @fields = SeedUtils::fields_of($ih);

Extract the fields from a tab-delimited input line. The input line is read from an open file handle.

ih: Open input file handle.
RETURN: Returns a list consisting of the tab-delimited fields found in the record read, or an empty list if we are at end-of-file.

probably_active

    my $activeFlag = SeedUtils::probably_active($vc);

Return TRUE if the specified variant code is most likely for an active variant, else FALSE.

abbrev_set

    my $abbrevH = SeedUtils::abbrev_set($genome_names);

Takes a pointer to a list of genome names and returns a hash mapping names to unique abbreviations. The names will be less than or equal to 10 characters in length.

genome_names: Pointer to a list of genome names
RETURN: Returns a hash mapping full names to unique abbreviations.

bbh_data

    my $bbhList = FIGRules::bbh_data($peg, $cutoff);

Return a list of the bi-directional best hits relevant to the specified PEG.

peg: ID of the feature whose bidirectional best hits are desired.
cutoff: Similarity cutoff. If omitted, 1e-10 is used.
RETURN: Returns a reference to a list of 3-tuples. The first element of the list is the best-hit PEG; the second element is the score. A lower score indicates a better match. The third element is the normalized bit score for the pair; it is normalized to the length of the protein.

between

    my $flag = between($x, $y, $z);

Determine whether or not $y is between $x and $z.

x: First edge number.
y: Number to examine.
z: Second edge number.
RETURN: Return TRUE if the number $y is between the numbers $x and $z. The check is inclusive (that is, if $y is equal to $x or $z the function returns TRUE), and the order of $x and $z does not matter. If $x is lower than $z, then the return is TRUE if $x <= $y <= $z. If $z is lower, then the return is TRUE if $x >= I$<$y> >= $z.

boundaries_of

    my ($contig, $min, $max, $dir) = boundaries_of($locs);

Return the boundaries of a set of locations. The contig, the leftmost location, and the rightmost location will be returned to the caller. If more than one contig is represented, the method will return an undefined value for the contig (indicating failure).

locs: Reference to a list of location strings. A location string contains a contig ID, and underscore (_), a starting offset, a strand identifier (+ or -), and a length (e.g. 360108.3:NC_10023P_1000+2000 begins at offset 1000 of contig 360108.3:NC_10023P and covers 2000 base pairs on the + strand).
RETURN: Returns a 4-element list. The first element is the contig ID from all the locations, the second is the offset of leftmost base pair represented in the locations, the third is the offset of the rightmost base pair represented in the locations, and the fourth is the dominant strand.

boundary_loc

    my $singleLoc = SeedUtils::boundary_loc($locations);

Return a single location string (see "Location Strings") that covers the incoming list of locations. NOTE that if the locations listed span more than one contig, this method may return an unexpected result.

This method is useful for converting the output of "fid_locations" in SAP to location strings.

locations: A set of location strings formatted as a comma-separated list or as a reference to a list of location strings.
RETURN: Returns a single location string that covers as best as possible the list of incoming locations.

by_fig_id

    my @sorted_by_fig_id = sort { by_fig_id($a,$b) } @fig_ids;

Compare two feature IDs.

This function is designed to assist in sorting features by ID. The sort is by genome ID followed by feature type and then feature number.

a: First feature ID.
b: Second feature ID.
RETURN: Returns a negative number if the first parameter is smaller, zero if both parameters are equal, and a positive number if the first parameter is greater.

create_fasta_record

    my $fastaString = create_fasta_record($id, $comment, $sequence, $stripped);

Create a FASTA record from the specified DNA or protein sequence. The sequence will be split into 60-character lines, and the record will include an identifier line.

id: ID for the sequence, to be placed at the beginning of the identifier line.
comment (optional): Comment text to place after the ID on the identifier line. If this parameter is empty, undefined, or 0, no comment will be placed.
sequence: Sequence of letters to form into FASTA. For purposes of convenience, whitespace characters in the sequence will be removed automatically.
stripped (optional): If TRUE, then the sequence will be returned unmodified instead of converted to FASTA format. The default is FALSE.
RETURN: Returns the desired sequence in FASTA format.

display_id_and_seq

    SeedUtils::display_id_and_seq($id_and_comment, $seqP, $fh);

Display a fasta ID and sequence to the specified open file. This method is designed to work well with "read_fasta_sequence" and "rev_comp", because it takes as input a string pointer rather than a string. If the file handle is omitted it defaults to STDOUT.

The output is formatted into a FASTA record. The first line of the output is preceded by a > symbol, and the sequence is split into 60-character chunks displayed one per line. Thus, this method can be used to produce FASTA files from data gathered by the rest of the system.

id_and_comment: The sequence ID and (optionally) the comment from the sequence's FASTA record. The ID
seqP: Reference to a string containing the sequence. The sequence is automatically formatted into 60-character chunks displayed one per line.
fh: Open file handle to which the ID and sequence should be output. If omitted, \*STDOUT is assumed.

display_seq

    SeedUtils::display_seq(\$seqP, $fh);

Display a fasta sequence to the specified open file. If the file handle is omitted it defaults to STDOUT.

The sequence is split into 60-character chunks displayed one per line for readability.

seqP: Reference to a string containing the sequence.
fh: Open file handle to which the sequence should be output. If omitted, STDOUT is assumed.

extract_seq

 $seq = &SeedUtils::extract_seq($contigs,$loc)

This is just a little utility routine that I have found convenient. It assumes that $contigs is a hash that contains IDs as keys and sequences as values. $loc must be of the form

       Contig_Beg_End

where Contig is the ID of one of the sequences; Beg and End give the coordinates of the sought subsequence. If Beg > End, it is assumed that you want the reverse complement of the subsequence. This routine plucks out the subsequence for you.

file_read

    my $text = SeedUtils::file_read($fileName);

    my @lines = SeedUtils::file_read($fileName);

Read an entire file into memory. In a scalar context, the file is returned as a single text string with line delimiters included. In a list context, the file is returned as a list of lines, each line terminated by a line delimiter. (For a method that automatically strips the line delimiters, use StringUtils::GetFile.)

fileName: Fully-qualified name of the file to read.
RETURN: In a list context, returns a list of the file lines. In a scalar context, returns a string containing all the lines of the file with delimiters included.

file_head

    my $text = SeedUtils::file_head($fileName, $count);

    my @lines = SeedUtils::file_head($fileName, $count);

Read a portion of a file into memory. In a scalar context, the file portion is returned as a single text string with line delimiters included. In a list context, the file portion is returned as a list of lines, each line terminated by a line delimiter.

fileName: Fully-qualified name of the file to read.
count (optional): Number of lines to read from the file. If omitted, 1 is assumed. If the non-numeric string * is specified, the entire file will be read.
RETURN: In a list context, returns a list of the desired file lines. In a scalar context, returns a string containing the desired lines of the file with delimiters included.

flatten_dumper

    SeedUtils::flatten_dumper( $perl_ref_or_object_1, ... );

Takes a list of perl references or objects, and "flattens" their Data::Dumper() output so that it can be printed on a single line.

genome_of

    my $genomeID = genome_of($fid);

Return the Genome ID embedded in the specified FIG feature ID.

fid: Feature ID of interest.
RETURN: Returns the genome ID in the middle portion of the FIG feature ID. If the feature ID is invalid, this method returns an undefined value.

hypo

    my $flag = hypo($func);

Return TRUE if the specified functional role is hypothetical, else FALSE. Hypothetical functional roles are identified by key words in the text, such as hypothesis, predicted, or glimmer (among others).

func: Text of the functional role whose nature is to be determined.
RETURN: Returns TRUE if the role is hypothetical, else FALSE.

id_url

    my $url = id_url($id);

Return the URL for a specified external gene ID.

id: ID of the gene whose URL is desired.
RETURN: Returns a URL for displaying information about the specified gene. The structure of the ID is used to determine the web site to which the gene belongs.

location_cmp

    my $cmp = location_cmp($loc1, $loc2);

Compare two location strings (see "Location Strings").

The ordering principle for locations is that they are sorted first by contig ID, then by leftmost position, in reverse order by length, and then by direction. The effect is that within a contig, the locations are ordered first and foremost in the way they would appear when displayed in a picture of the contig and second in such a way that embedded locations come after the locations in which they are embedded. In the case of two locations that represent the exact same base pairs, the forward (+) location is arbitrarily placed first.

loc1: First location string to compare.
loc2: Second location string to compare.
RETURN: Returns a negative number if the loc1 location sorts first, a positive number if the loc2 location sorts first, and zero if the two locations are the same.

location_string

    my $locString = location_string($contig, $beg, $end);

Form a location string for the specified contig that starts at the indicated begin location and stops at the indicated end location. A single-base location will automatically be put on the forward strand.

contig: ID of the contig to contain this location.
beg: Beginning offset of the location.
end: Ending offset of the location.
RETURN: Returns a location string (see "Location Strings") for the specified location.

max

    my $max = max(@nums);

Return the maximum number from all the values in the specified list.

nums: List of numbers to examine.
RETURN: Returns the maximum numeric value from the specified parameters, or an undefined value if an empty list is passed in.

min

    my $min = min(@nums);

Return the minimum number from all the values in the specified list.

nums: List of numbers to examine.
RETURN: Returns the minimum numeric value from the specified parameters, or an undefined value if an empty list is passed in.

parse_fasta_record

    my ($id, $comment, $seq) = parse_fasta_record($string);

Extract the ID, comment, and sequence from a single FASTA record. For backward compatability, instead of a FASTA record the ID and sequence can be specified separated by a comma. In this case, the returned comment will be empty.

string: A single FASTA record, or an ID and sequence separated by a single comma, an unadorned sequence, a 2-element list consisting of an ID and a sequence, or a 3-element list consisting of an ID, a comment, and a sequence.
RETURN: Returns a three-element list consisting of the incoming ID, the associated comment, and the specified DNA or protein sequence. If the incoming string is invalid, all three list elements will come back undefined. If no ID is specified, an MD5 will be provided.

parse_location

    my ($contig, $begin, $end, $strand) = parse_location($locString);

Return the contigID, start offset, and end offset for a specified location string (see "Location Strings").

locString: Location string to parse.
RETURN: Returns a four-element list containing the contig ID from the location string, the starting offset of the location, the ending offset, and the strand. If the location string is not valid, the values returned will be undef.

rev_comp

    my $revcmp = rev_comp($dna);

    rev_comp(\$dna);

Return the reverse complement of a DNA string.

dna: Either a DNA string, or a reference to a DNA string.
RETURN: If the input is a DNA string, returns the reverse complement. If the input is a reference to a DNA string, the string itself is reverse complemented.

roles_for_loading

    my ($roles, $errors) = SeedUtils::roles_for_loading($function);

Split a functional assignment into roles. If the functional assignment seems suspicious, it will be flagged as invalid. A count will be returned of the number of roles that are rejected because they are too long.

This method should not be used for Shrub functions.

function: Functional assignment to parse.
RETURN: Returns a two-element list. The first is either a reference to a list of roles, or an undefined value (indicating a suspicious functional assignment). The second is the number of roles that are rejected for being too long.

roles_of_function

    my @roles = roles_of_function($assignment);

Return a list of the functional roles in the specified assignment string. A single assignment may contain multiple roles as well as comments; this method separates them out.

This method should not be used for Shrub functions.

assignment: Functional assignment to parse for roles.
RETURN: Returns a list of the individual roles in the assignment.

sims

    my @sims = sims($id, $maxN, $maxP, 'fig');

    my @sims = sims($id, $maxN, $maxP, 'all);

Retrieve similarities from the network similarity server. The similarity retrieval is performed using an HTTP user agent that returns similarity data in multiple chunks. An anonymous subroutine is passed to the user agent that parses and reformats the chunks as they come in. The similarites themselves are returned as Sim objects. Sim objects are actually list references with 15 elements. The Sim object methods allow access to the elements by name.

Similarities can be either raw or expanded. The raw similarities are basic hits between features with similar DNA. Expanding a raw similarity drags in any features considered substantially identical. So, for example, if features A1, A2, and A3 are all substatially identical to A, then a raw similarity [C,A] would be expanded to [C,A] [C,A1] [C,A2] [C,A3].

id: ID of the feature whose similarities are desired, or reference to a list of the IDs of the features whose similarities are desired.
maxN (optional): Maximum number of similarities to return for each incoming feature.
maxP (optional): The maximum allowable similarity score.
select (optional): Selection criterion: raw means only raw similarities are returned; fig means only similarities to FIG features are returned; all means all expanded similarities are returned; and figx means similarities are expanded until the number of FIG features equals the maximum.
max_expand (optional): The maximum number of features to expand.
filters (optional): Reference to a hash containing filter information, or a subroutine that can be used to filter the sims.
RETURN: Returns a list of Sim objects.

genetic_code

    my $code = genetic_code();

Return a hash containing the translation of nucleotide triples to proteins. Methods such as "translate" can take a translation scheme as a parameter. This method returns the translation scheme for genetic code 11 or 4, and an error for all other cocdes. The scheme is implemented as a reference to a hash that contains nucleotide triplets as keys and has protein letters as values.

standard_genetic_code

    my $code = standard_genetic_code();

Return a hash containing the standard translation of nucleotide triples to proteins. Methods such as "translate" can take a translation scheme as a parameter. This method returns the default translation scheme. The scheme is implemented as a reference to a hash that contains nucleotide triplets as keys and has protein letters as values.

strand_of

    my $plusOrMinus = strand_of($loc);

Return the strand (+ or -) from the specified location string.

loc: Location string to parse (see "Location Strings" in SAP).
RETURN: Returns + if the location is on the forward strand, else -.

strip_ec

    my $role = strip_ec($rawRole);

Strip the EC number (if any) from the specified role or functional assignment.

rawRole: Role or functional assignment from which the EC numbers are to be stripped.
RETURN: Returns the incoming string with any EC numbers removed. The EC numbers must be formatted in the standard format used by the SEED (with the EC prefix and surrounding parentheses).

translate

    my $aa_seq = translate($dna_seq, $code, $fix_start);

    my $aa_seq = translate(\$dna_seq, $offset, $code, $fix_start);

Translate a DNA sequence to a protein sequence using the specified genetic code. If $fix_start is TRUE, will translate an initial TTG or GTG code to M. (In the standard genetic code, these two combinations normally translate to V and L, respectively.)

dna_seq: DNA sequence to translate. Note that unknown nucleotides will result in a translation to X. May also be a reference to the sequence.
offset: Index of the first position at which to start translation (only if the dna sequence is a string reference).
code: Reference to a hash specifying the translation code. The hash is keyed by nucleotide triples, and the value for each key is the corresponding protein letter. If this parameter is omitted, the "standard_genetic_code" will be used.
fix_start: TRUE if the first triple is to get special treatment, else FALSE. If TRUE, then a value of TTG or GTG in the first position will be translated to M instead of the value specified in the translation code.
RETURN: Returns a string resulting from translating each nucleotide triple into a protein letter.

type_of

    my $type = SeedUtils::type_of($fid);

Return the type of a feature, given a FIG feature ID (e.g. fig|100226.1.peg.3361).

fid: ID of a feature whose type is desired.
RETURN: Returns the type of the feature (e.g. peg, rna, ...).

verify_dir

    verify_dir($dirName);

Insure that the specified directory exists. If the directory does not exist, it will be created.

dirName: Name of the relevant directory.

validate_fasta_file

    $sequence_type = validate_fasta_file($in_file, $out_file)

Ensure the given file is in valid fasta format. If $out_file is given, write the data to $out_file as a normalized fasta file (with cleaned up line endings, upper case data).

If successful, returns the string "dna" or "protein".

Will invoke die() on failure; call inside eval{} to ensure full error catching.

strip_func

    my $stripped = SeedUtils::strip_func($func);

Remove the comment and the FIGfam identifier from a function.

strip_func_comment

    my ($stripped, $comment) = strip_func_comment($func);

    my $stripped = strip_func_comment($func);

Split the comment from a function.

compute_metrics

    my $metricHash = SeedUtils::compute_metrics(\@lens, $totLen);

Compute metrics about a genome given a list of contig lengths. The metrics returned will include N50, N70, N90, total DNA length, and probable completeness.

lens

Reference to a list of contig lengths.

totLen (optional)

The total length of all the contigs.

RETURN

Returns a reference to a hash with the following keys.

N50: The N50 of the contig lengths (see "n_metric").
N70: The N70 of the contig lengths.
N90: The N90 of the contig lengths.
totlen: The total DNA length.
complete: 1 if the genome is mostly complete, else 0.

canonical_function

    $clean_function = canonical_function($function);

Functions with leading space, trailing space, tabs, etc. need to be cleaned.

verify_db

    verify_db($db, $type);

Insure we have a blast database for the specified FASTA.

compare_region_color

    my ($r,$g,$b) = compare_region_color($n);

Return the nth color for compare region displays.

write_encoded_object

  write_encoded_object( $json,  $filename   [, \%options] )
  write_encoded_object( $json, \*FILEHANDLE [, \%options] )
  write_encoded_object( $json, \$string     [, \%options] )
  write_encoded_object( $json               [, \%options] )    # D = \*STDOUT

Write a PERL object to an output file in json format.

Options:

     condensed => $bool   #  If true, do not invoke 'pretty'
     pretty    => $bool   #  If explicitly false, do not invoke 'pretty'
     canonical => $bool   #  If true, emit canonical form. Can be expensive on large objects.

read_encoded_object

  $object = read_encoded_object(  $filename )
  $object = read_encoded_object( \*FILEHANDLE )
  $object = read_encoded_object( \$string )
  $object = read_encoded_object( )                # D = \*STDIN

Read a JSON object from a file.

read_ids

    my @ids = SeedUtils::read_ids($fileName);

Read a list of IDs from a tab-delimited file. The IDs are taken from the first column of each record.

fileName: Name of the file from which to read the IDs.
RETURN: Returns a list of the IDs read.