SeqFindr BLAST methods
Given a fasta_file, generate a nucleotide BLAST database
Database will end up in DB/ of working directory or OUTPUT/DB if an output directory is given in the arguments
Parameters: | fasta_file (string) – full path to a fasta file |
---|---|
Return type: | the strain id (must be delimited by ‘_’) |
Using NCBIXML parse the BLAST results, storing & returning good hits
Parameters: | |
---|---|
Return type: | list of satifying hit names |
Given a mfa of query sequences of interest & a database, search for them.
# TODO: Add evalue filtering ? # TODO: add task=’blastn’ to use blastn scoring ?
Warning
default is megablast
Warning
tblastx funcationality has not been checked
Parameters: | |
---|---|
Returns: | the path of the blast.xml file |
SeqFindr configuration class: 100% test coverage, > 9 PyLint score
Bases: object
A SeqFindr configuration class - subtle manipulation to plots
Prints all set configuration options to STDOUT
Read a SeqFindr configuration file
Currently only supports category colors in RGB format
category_colors = [(0,0,0),(255,255,255),....,(r,g,b)]
Generate a list of length number of distinct “good” random colors
See: https://github.com/fmder/ghalton
Parameters: |
|
---|---|
Type: | int |
Type: | int |
Return type: | a list of lists in the form: [[243, 137, 121], [232, 121, 243], [216, 121, 243]] |
Convert HSV to RGB
Parameters: |
|
---|
SeqFindr v0.34.0 - A tool to easily create informative genomic feature plots (http://github.com/mscook/SeqFindr)
Populate row given all possible hits, accepted hits and an optional score
Parameters: | |
---|---|
Return type: | a list of floats |
Check if there are any informative sites in the matrix
From a matrix, generate a distance matrix & perform hierarchical clustering
Parameters: | |
---|---|
Returns: | a tuple of the updated (clustered) matrix & the updated labels |
The ‘core’ SeqFindr method
TODO: Exception handling if do_run fails or produces no results
Parameters: | args – the arguments given from argparse |
---|
Determine the value in the matrix assigned to nohit given SeqFindr options
Parameters: |
|
---|---|
Returns: | the value defined as no hit in the results matrix |
Perform a SeqFindr run
Reorder a second matrix based on the first row element of the 1st matrix
Parameters: | |
---|---|
Return type: | 2 matricies (2D lists) |
Plot the VF hit matrix
Parameters: |
|
---|
Given a set of sequences of interest, extract all query & query classes
A sequence of interest file is a mfa file in the format:
>ident, gene id, annotation, organism [class]
query = gene id query_class = class
Location of sequence of interest file is defined by args.seqs_of_interest
Parameters: | args (argparse args) – the argparse args containing args.seqs_of_interest (fullpath) to a sequence of interest DB (mfa file) |
---|---|
Return type: | 2 lists, 1) of all queries and, 2) corresponding query classes |
Strip the 1st and last ‘N’ bases from mapping consensuses
To avoid the effects of lead in and lead out coverage resulting in uncalled bases
Parameters: | args (argparse args) – the argparse args containing args.strip value |
---|---|
Return type: | the updated args to reflect the args.cons & args.seqs_of_interest location |
Remove the ID (1st row element) form a matrix
Parameters: | mat – a 2D list |
---|---|
Return type: | a 2D list with the 1st row elelemnt (ID) removed |
Remove any columns where all elements in every position are absent
Also handles the query classes and x_lables.
Attention
new feature added in version 0.4.0
Toogle using: args.remove_empty_cols
Parameters: |
|
---|---|
Returns: | a tuple with three elements which are the: updated SeqFindr matrix, the updated query_classes list and the updated query_list respectively. |
SeqFindr utility methods
Check the database conforms to the SeqFindr format
Note
this is not particulalry extensive
Args database_file: | |
---|---|
full path to a database file as a string |
Deletes the elements in a list given by a index_positions list
Parameters: | |
---|---|
Returns: | a list with the elements removed defined by the index_positions list |
Ensure all arguments with paths are absolute & have simplification removed
Just apply os.path.abspath & os.path.expanduser
Parameters: | args – the arguments given from argparse |
---|---|
Returns: | an updated args |
Returns all files ending with .fas/.fa/fna in a directory
Parameters: | data_path – the full path to the directory of interest |
---|---|
Returns: | a list of fasta files (valid extensions: .fas, .fna, .fa |
Create the output base (if needed) and change dir to it
Parameters: | args – the arguments given from argparse |
---|
Checks if a FASTA file is protein or nucleotide.
Will return -1 if no protein detected
TODO: Abiguity characters? TODO: exception if mix of protein/nucleotide?
Parameters: | fasta_file (string) – path to input FASTA file |
---|---|
Returns: | number of protein sequences in fasta_file (int) |
Given an order index file, maintain this order in the matrix plot
This implies no clustering. Typically used when you already have a phlogenetic tree.
Parameters: | |
---|---|
Return type: | list of updated glob.glob dir listing to match order specified |
Convert VFDB formatted files (or like) to SeqFindr formatted database files
VFDB: Virulence Factors Database www.mgc.ac.cn/VFs/ a reference database for bacterial virulence factors.
This is based on a sample file (TOTAL_Strep_VFs.fas) provided by Nouri Ben Zakour.
Examples:
# Default (will set VFDB classification identifiers as the classification)
$ vfdb_to_seqfindr -i TOTAL_Strep_VFs.fas -o TOTAL_Strep_VFs.sqf
# Sets any classification to blank ([ ])
$ vfdb_to_seqfindr -i TOTAL_Strep_VFs.fas -o TOTAL_Strep_VFs.sqf -b
# Reads a user defined classification. 1 per in same order as input
# sequences
$ python convert_vfdb_to_SeqFindr.py -i TOTAL_Strep_VFs.fas
-o TOTAL_Strep_VFs.sqf -c blah.dat
Suppose you want to annotate a VF class with user defined values. Simply develop a file containing the scheme (1-1 matching). If you had 6 input sequences and the first 3 are Fe transporters and the next two are Toxins and the final sequence is Misc your class file would look like this:
Fe transporter Fe transporter Fe transporter Toxins Toxins Misc
Ensure that all particualr classes are in the same block