samsifter.util package

Modules for general tasks shared among SamSifter, PMDtools and other tools.

Includes modules for the tasks of

  • argument sanitation, eg. filetype, value type and range checks
  • basic filter routines on file handles
  • standard functions operating on nucleotide sequences
  • annotation of SAM files with @PG records
  • reconstruction of reference sequences from SAM records
  • serialization of SamSifter workflows
  • validation of SamSifter workflows

Some of these may be generalized and refactored into proper packages in future releases.

arg_sanitation module

Methods for input sanitation during argument parsing.

check_csv(arg)[source]

Check if argument is a CSV file.

Parameters:arg (str) – filepath
Returns:filepath to SAM file
Return type:str
Raises:ArgumentTypeError – If file does not exist or has no .csv file extension.
check_percent(arg)[source]

Check if argument is a positive Float between 0 and 100.

Note

Use check_pos_float_max100() instead.

check_pos_float(arg)[source]

Check if argument is a positive Float.

Parameters:arg (str) – command line argument
Returns:typed value of command line argument
Return type:float
Raises:ArgumentTypeError – If value is negative.
check_pos_float_max1(arg)[source]

Check if argument is a positive Float between 0 and 1.

Parameters:arg (str) – command line argument
Returns:typed value of command line argument
Return type:float
Raises:ArgumentTypeError – If value is negative or larger than 1.0.
check_pos_float_max100(arg)[source]

Check if argument is a positive Float between 0 and 100.

Parameters:arg (str) – command line argument
Returns:typed value of command line argument
Return type:float
Raises:ArgumentTypeError – If value is negative or larger than 100.0.
check_pos_int(arg)[source]

Check if argument is a positive Integer.

Parameters:arg (str) – command line argument
Returns:typed value of command line argument
Return type:int
Raises:ArgumentTypeError – If value is negative.
check_sam(arg)[source]

Check if argument is a SAM file.

Parameters:arg (str) – filepath
Returns:filepath to SAM file
Return type:str
Raises:ArgumentTypeError – If file does not exist or has no .sam file extension.

genetics module

Library of standard methods for manipulation of genetic sequences.

aln_identity(read, ref, include_indels=False, include_deamination=False, include_unknown=False)[source]

Determines identity of two sequences in an alignment.

Calculates modified Hamming distance of an alignment with gaps with optional exclusion of possibly deaminated T>C and A>G as well as indels.

Enabling all three optional parameters will calculate values identical to MALT while the default settings will calculate values identical to PMDtools.

Parameters:
  • read (str) – Full read sequence.
  • ref (str) – Corresponding reference sequence (only the aligned part, but including any gaps, skips and indels).
  • include_indels (bool, optional) – Consider insertions and deletions, defaults to False.
  • include_deamination (bool, optional) – Consider any potentially deaminated bases, defaults to False.
  • include_unknown (bool, optional) – Consider any unknown bases (N), defaults to False.
Returns:

  • float – Identity of read and reference as fraction of 1.
  • str – Mismatch string (match = |, mismatch = x, indel = -).

Example

Showing differences between PMDtools and MALT:

TCCAGCAGGTCGATGACCTTGATGCCGGTCTCGAACATCTTCA
||-|||||||||||||||||||||||||||||||||||||-|x
TC-AGCAGGTCGATGACCTTGATGCCGGTCTCGAACATCT-CG
]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]

43 bases
2 gaps
1 total mismatches
    1 G>A mismatches (A in read but G in reference)
    0 C>T mismatches (T in read but C in reference)
    0 other mismatches
40 matches

PMD:    40 / 40 = 100%
identity = 1.0 * matches / (matches + other mismatches)
         = 1.0 * 40 / (40 + 0)
         = 1.0

MALT:   40 / 43 = 93%
identity = 1.0 * matches / (matches + gaps + total mismatches)
         = 1.0 * 40 / (40 + 2 + 1)
         = 0.930232
         ~ 0.93

Thus it is required to include indels, potentially deaminated bases and unknown bases to produce identity values similar to MALT.

complement(base)[source]

Returns a base’s complement according to IUPAC or None.

Parameters:base (str) – Single one-letter IUPAC nucleotide code.
Returns:
  • str – Complementary one-letter IUPAC ambiguity code.
  • None – If base is invalid.
gc(seq)[source]

Calculate GC content of a nucleotide sequence.

Considers all IUPAC codes using 6x counts to deal with 1/2 and 1/3 probabilities from ambiguity codes.

Parameters:seq (str) – A nucleotide sequence.
Returns:GC content as fraction of 1.
Return type:float
is_iupac(seq)[source]

Check nucleotide sequence for invalid (non-IUPAC) codes.

Parameters:seq (str) – Nucleotide sequence.
Returns:True if sequence contains only IUPAC ambiguity codes, otherwise False.
Return type:bool
opposite(base)[source]

Mirror base code into the exact opposite IUPAC ambiguity code.

‘A’ becomes ‘not A’, gap becomes any base, etc.

Parameters:base (str) – Single one-letter IUPAC nucleotide code.
Returns:
  • str – Opposite one-letter IUPAC ambiguity code.
  • None – If base is invalid.
reverse(seq)[source]

Reverses a sequence.

Parameters:seq (str) – Any sequence.
Returns:Reverse sequence.
Return type:str
reverse_complement(seq)[source]

Reverse complement a nucleotide sequence.

Parameters:seq (str) – A nucleotide sequence of IUPAC ambuigity codes.
Returns:Reverse complement of the nucleotide sequence.
Return type:str
reverse_transcribe(rna)[source]

Reverse transcription of RNA sequence to corresponding cDNA sequence.

Parameters:rna (str) – RNA nucleotide sequence.
Returns:DNA nucleotide sequence.
Return type:str
transcribe(dna)[source]

Transcription of DNA sequence to corresponding RNA sequence.

Parameters:dna (str) – DNA nucleotide sequence.
Returns:RNA nucleotide sequence.
Return type:str

papertrail module

Annotation of SAM file headers with metadata on applied programs.

create_pg_line(identifier, name=None, command_line=None, previous=None, description=None, version=None)[source]

Create a @PG header for SAM files.

Follows SAMv1 specification, see https://samtools.github.io/hts-specs/SAMv1.pdf

Parameters:
  • identifier (str) – Unique program identifier within the context of the current SAM file that is identical to the identifier used in the alignment section. May be modified by later processing steps.
  • name (str, optional) – Program name, defaults to None.
  • command_line (str, optional) – Program command line excluding program name/call, just the optional and positional arguments. Defaults to None.
  • previous (str, optional) – Program identifier of previous step in program chain.
  • description (str, optional) – Short description of program functionality.
  • version (str, optional) – Program version number/string.
Returns:

Complete standard-compliant @PG header line.

Return type:

str

find_last_in_chain(current_id, pg_entries)[source]

Recursively identify the last program entry in the chain.

@PG entries in SAM file headers can be chained by linking each entry to the previously applied program using the optional PP tag. The actual last entry of the chain can only be found under the assumption that all programs are listed within one chain.

Note

May lead to wrong assumptions about the actual order of programs applied to dataset if the chain of PG entries is interrupted by missing PP tags. Use at own risk!

Parameters:
  • current_id (str) – Unique ID of program entry that serves as arbitrary start point for the recursive search of the chain end.
  • pg_entries (dict of dicts) – Dictionary of @PG record tags (key:value) referenced by their unique program ID.
Returns:

Unique ID of the assumed last entry in the program chain (see note above!)

Return type:

str

main()[source]

Simple test of PG header generation.

modify_header(lines, name=None, command_line=None, description=None, version=None)[source]

Modify SAM header by inserting a new @PG record into the chain.

Ensures unique program identifiers and integrity of program chain. New entries are always inserted after all current PG entries and their order is not changed in case the entries are not chained by optional PP tags.

Note

Chaining of entries with PP tags is disabled as the chain of programs is often incomplete and may be misleading! The recursion through existing entries may not always be sufficient to identify the current last entry in the chain.

Parameters:
  • lines (list of str) – Full set of header lines of SAM file, including the first header line with @HD record.
  • name (str, optional) – Program name, defaults to None.
  • command_line (str, optional) – Program command line excluding program name/call, just the optional and positional arguments. Defaults to None.
  • description (str, optional) – Short description of program functionality.
  • version (str, optional) – Program version number/string.

sam module

Reconstruction of reference sequences from CIGAR and MD tags in SAM files.

Includes tests for various scenarios of clipping, insertion and deletion.

decompress_cigar(cigar)[source]

Decompress CIGAR string.

Parameters:cigar (str) – full CIGAR string (numbers and letters)
Returns:cigar_ext – decompressed CIGAR string (letters only)
Return type:str
decompress_md(md)[source]

Decompress MD string.

Parameters:md (str) – MD string from SAM file without MD:Z: prefix
Returns:md_ext – extended MD string without MD:Z: prefix
Return type:str
main(argv)[source]

Runs simple tests of the reference reconstruction method.

reconstruct(read, cigar, md)[source]

Reconstruct both the full and the aligned reference sequence.

Uses imformation from CIGAR string and optional MD tag to reconstruct the reference sequence. The full reference sequence including indels, clips and skips and the aligned substring can be used for calculation of identity and post-mortem degradation.

Note

Padded references should work as well, but remain untested. Use at your own risk!

Parameters:
  • read (str) – Raw read sequence without gaps, can be clipped.
  • cigar (str) – CIGAR string from SAM file
  • md (str) – matching MD string from SAM file without MD:Z: prefix
Returns:

  • ref_full (str) – full reference sequence including indels, clipped and skipped parts
  • read_full (str) – full read sequence including indels, clipped and skipped parts
  • ref_aln (str) – aligned part of reference sequence (can include gaps)
  • read_aln (str) – aligned part of read sequence (clipped, can include gaps)

test_deletion_snp_0_delim(verbose=False)[source]

Tests reference reconstruction with 0 delimiter after deletion in MD.

Parameters:verbose (bool) – Print additional output to STDOUT, defaults to False.
Returns:True if test was successful, otherwise False.
Return type:bool
test_deletion_snp_no_delim(verbose=False)[source]

Tests reconstruction despite missing 0 delimiter after deletion in MD.

Parameters:verbose (bool) – Print additional output to STDOUT, defaults to False.
Returns:True if test was successful, otherwise False.
Return type:bool
test_hard_clipping_end(verbose=False)[source]

Tests reference reconstruction with hard clipped 3’ read.

Parameters:verbose (bool) – Print additional output to STDOUT, defaults to False.
Returns:True if test was successful, otherwise False.
Return type:bool
test_hard_clipping_front(verbose=False)[source]

Tests reference reconstruction with hard clipped 5’ read.

Parameters:verbose (bool) – Print additional output to STDOUT, defaults to False.
Returns:True if test was successful, otherwise False.
Return type:bool
test_insertion(verbose=False)[source]

Tests reference reconstruction with insertion to reference.

Parameters:verbose (bool) – Print additional output to STDOUT, defaults to False.
Returns:True if test was successful, otherwise False.
Return type:bool
test_skipped_intron(verbose=False)[source]

Tests reference reconstruction with skipped intron.

Parameters:verbose (bool) – Print additional output to STDOUT, defaults to False.
Returns:True if test was successful, otherwise False.
Return type:bool
test_soft_clipping_end(verbose=False)[source]

Tests reference reconstruction with soft clipped 3’ read.

Parameters:verbose (bool) – Print additional output to STDOUT, defaults to False.
Returns:True if test was successful, otherwise False.
Return type:bool
test_soft_clipping_front(verbose=False)[source]

Tests reference reconstruction with soft clipped 5’ read.

Parameters:verbose (bool) – Print additional output to STDOUT, defaults to False.
Returns:True if test was successful, otherwise False.
Return type:bool

serialize module

(De-)Serialization of SamSifter workflows.

class WorkflowSerializer[source]

Bases: builtins.object

(De-)Serializing workflow data for permanent storage.

Uses ElementTree and minidom libraries to write to and read from XML files.

classmethod deserialize(workflow, filename)[source]

De-serializes workflow from XML file.

Parameters:
  • workflow (Workflow) – SamSifter workflow object to be deserialized.
  • filename (str) – Readable path to existing XML file.
Returns:

True on success.

Return type:

bool

classmethod determine_type(value)[source]

Determines type of a typed value and return it as string.

Helper method for serialization of typed Python values to string representations in XML.

Parameters:value (bool, str, int, float) – Typed value.
Returns:String representation of value type.
Return type:str
classmethod prettify(elem, indent=' ')[source]

Create a pretty-printed XML string for an individual tree element.

Parameters:
  • elem (Element) – Tree element.
  • indent (str, optional) – Indentation character(s) to distinguish tree levels, defaults to doublespace.
Returns:

Pretty XML string representing the element.

Return type:

str

classmethod serialize(workflow, filename)[source]

Serializes workflow to XML file.

Parameters:
  • workflow (Workflow) – SamSifter workflow object to be serialized.
  • filename (str) – Writable path to new XML file.
Returns:

True on success.

Return type:

bool

classmethod singletons(value)[source]

Replaces ‘None’, ‘False’ and ‘True’ strings with actual singletons.

Helper method for deserialization from strings to actual Python singleton entities.

Parameters:value (str) – Value string from XML file.
Returns:
  • None – If string represents None.
  • bool – If string represents True or False.
  • str – If string does not represent a singleton.
classmethod tree_from_file(filename)[source]

Initializes element tree from XML file.

Parameters:filename (str) – Readable path to existing XML file.
Returns:Tree representation of XML file contents.
Return type:ElementTree
classmethod tree_to_file(tree, filename)[source]

Write element tree to XML file.

Parameters:
  • tree (ElementTree) – Tree representation of workflow.
  • filename (str) – Writable path to new XML file.
Returns:

True on success.

Return type:

bool

classmethod tree_to_str(tree, pretty=False)[source]

Converts XML element tree to optionally indented text string.

Parameters:
  • tree (ElementTree) – Tree to be printed.
  • pretty (bool, optional) – Enable indentation and multiline string representation for better legibility of the XML, default to False.
Returns:

XML string representing the tree.

Return type:

str

classmethod workflow_from_xml(tree, workflow)[source]

Generate workflow from XML element tree.

Parameters:
  • ElementTree – Tree representation of workflow.
  • workflow (Workflow) – SamSifter workflow container object to be de-serialized into.
Returns:

  • workflow (Workflow) – SamSifter workflow container object.
  • None – If tree is unitialized.

classmethod workflow_to_xml(workflow)[source]

Generate XML tree from workflow container.

Parameters:workflow (Workflow) – SamSifter workflow container object to be serialized.
Returns:
  • ElementTree – Tree representation of workflow
  • None – If workflow is unitialized.

validation module

Validation of workflows.

class WorkflowValidator(workflow, parent=None)[source]

Bases: PyQt4.QtCore.QObject

Validator for SamSifter workflows.

Starts off with empty error lists for input, model and output that can be retrieved individually after running validate(). The individual steps of validation should always be run in the same order.

Lists of error messages can be retrieved individually for each of the three validation steps

  • input validation
  • filter model validation
  • output validation

by calling the corresponding getter methods. They will be reported as one concatenated list when running the full validation.

get_input_errors()[source]
get_model_errors()[source]
get_output_errors()[source]
validate()[source]

Validate input, workflow model and output in this order.

Returns:List of error messages occuring in input, model and output validation.
Return type:list of str
validate_input()[source]

Validate input file of workflow.

validate_model()[source]

Validate workflow model.

validate_output()[source]

Validate output file of workflow.

filters module

Basic filtering operations on files.

line_filter(lines, filehandle, discard=True, offset=0)[source]

Filters specific lines from a file.

Prints only lines not contained in list to STDOUT while ignoring the number of lines in the beginning as defined by offset (default is 0 = no header). Inverse operation to print only lines contained in list by setting discard to False.

Parameters:
  • lines (list of int) – List of line numbers to remove from file. Line numbers are considered to be 0-based unless an offset is specified. List can be unsorted and duplicates will be removed prior to filtering.
  • filehandle (File) – Opened and readable file object.
  • discard (bool, optional) – Print only lines matching none of the entries to STDOUT. Defaults to True.
  • offset (int, optional) – Positive offset for line numbers to be used in case the cursor of the filehandle has placed after the start of the line numbering (‘fast forward’). Useful to skip header sections of a file.
Returns:

True on success, False on empty list.

Return type:

bool

Raises:
  • Exception – If file ends before all specified lines are filtered (indicating wrong use of the offset parameter).
  • IndexError – If first line to be filtered is within the specified offset.
  • ValueError – If offset is negative.
main()[source]

Simple test of pattern_filter and line_filter methods.

pattern_filter(patterns, filehandle, discard=True)[source]

Emulates grep-like inverse pattern search.

Emulates the behaviour of grep -v -f PATTERNFILE and prints only non-matching lines to STDOUT. Inverse operation to print only lines matching at least one of the patterns by setting discard to False.

Parameters:
  • patterns (list of str) – List of string patterns to search.
  • filehandle (File) – Opened and readable file object.
  • discard (bool, optional) – Print only lines matching none of the patterns to STDOUT. Defaults to True.
Returns:

True on successful filtering, False on empty pattern list or other error.

Return type:

bool

Table Of Contents

Previous topic

samsifter.stats package

Next topic

samsifter.views package

This Page