Modules for general tasks shared among SamSifter, PMDtools and other tools.
Includes modules for the tasks of
Some of these may be generalized and refactored into proper packages in future releases.
Methods for input sanitation during argument parsing.
Check if argument is a CSV file.
Parameters: | arg (str) – filepath |
---|---|
Returns: | filepath to SAM file |
Return type: | str |
Raises: | ArgumentTypeError – If file does not exist or has no .csv file extension. |
Check if argument is a positive Float between 0 and 100.
Note
Use check_pos_float_max100() instead.
Check if argument is a positive Float.
Parameters: | arg (str) – command line argument |
---|---|
Returns: | typed value of command line argument |
Return type: | float |
Raises: | ArgumentTypeError – If value is negative. |
Check if argument is a positive Float between 0 and 1.
Parameters: | arg (str) – command line argument |
---|---|
Returns: | typed value of command line argument |
Return type: | float |
Raises: | ArgumentTypeError – If value is negative or larger than 1.0. |
Check if argument is a positive Float between 0 and 100.
Parameters: | arg (str) – command line argument |
---|---|
Returns: | typed value of command line argument |
Return type: | float |
Raises: | ArgumentTypeError – If value is negative or larger than 100.0. |
Library of standard methods for manipulation of genetic sequences.
Determines identity of two sequences in an alignment.
Calculates modified Hamming distance of an alignment with gaps with optional exclusion of possibly deaminated T>C and A>G as well as indels.
Enabling all three optional parameters will calculate values identical to MALT while the default settings will calculate values identical to PMDtools.
Parameters: |
|
---|---|
Returns: |
|
Example
Showing differences between PMDtools and MALT:
TCCAGCAGGTCGATGACCTTGATGCCGGTCTCGAACATCTTCA
||-|||||||||||||||||||||||||||||||||||||-|x
TC-AGCAGGTCGATGACCTTGATGCCGGTCTCGAACATCT-CG
]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
43 bases
2 gaps
1 total mismatches
1 G>A mismatches (A in read but G in reference)
0 C>T mismatches (T in read but C in reference)
0 other mismatches
40 matches
PMD: 40 / 40 = 100%
identity = 1.0 * matches / (matches + other mismatches)
= 1.0 * 40 / (40 + 0)
= 1.0
MALT: 40 / 43 = 93%
identity = 1.0 * matches / (matches + gaps + total mismatches)
= 1.0 * 40 / (40 + 2 + 1)
= 0.930232
~ 0.93
Thus it is required to include indels, potentially deaminated bases and unknown bases to produce identity values similar to MALT.
Returns a base’s complement according to IUPAC or None.
Parameters: | base (str) – Single one-letter IUPAC nucleotide code. |
---|---|
Returns: |
|
Calculate GC content of a nucleotide sequence.
Considers all IUPAC codes using 6x counts to deal with 1/2 and 1/3 probabilities from ambiguity codes.
Parameters: | seq (str) – A nucleotide sequence. |
---|---|
Returns: | GC content as fraction of 1. |
Return type: | float |
Check nucleotide sequence for invalid (non-IUPAC) codes.
Parameters: | seq (str) – Nucleotide sequence. |
---|---|
Returns: | True if sequence contains only IUPAC ambiguity codes, otherwise False. |
Return type: | bool |
Mirror base code into the exact opposite IUPAC ambiguity code.
‘A’ becomes ‘not A’, gap becomes any base, etc.
Parameters: | base (str) – Single one-letter IUPAC nucleotide code. |
---|---|
Returns: |
|
Reverses a sequence.
Parameters: | seq (str) – Any sequence. |
---|---|
Returns: | Reverse sequence. |
Return type: | str |
Reverse complement a nucleotide sequence.
Parameters: | seq (str) – A nucleotide sequence of IUPAC ambuigity codes. |
---|---|
Returns: | Reverse complement of the nucleotide sequence. |
Return type: | str |
Annotation of SAM file headers with metadata on applied programs.
Create a @PG header for SAM files.
Follows SAMv1 specification, see https://samtools.github.io/hts-specs/SAMv1.pdf
Parameters: |
|
---|---|
Returns: | Complete standard-compliant @PG header line. |
Return type: | str |
Recursively identify the last program entry in the chain.
@PG entries in SAM file headers can be chained by linking each entry to the previously applied program using the optional PP tag. The actual last entry of the chain can only be found under the assumption that all programs are listed within one chain.
Note
May lead to wrong assumptions about the actual order of programs applied to dataset if the chain of PG entries is interrupted by missing PP tags. Use at own risk!
Parameters: |
|
---|---|
Returns: | Unique ID of the assumed last entry in the program chain (see note above!) |
Return type: | str |
Modify SAM header by inserting a new @PG record into the chain.
Ensures unique program identifiers and integrity of program chain. New entries are always inserted after all current PG entries and their order is not changed in case the entries are not chained by optional PP tags.
Note
Chaining of entries with PP tags is disabled as the chain of programs is often incomplete and may be misleading! The recursion through existing entries may not always be sufficient to identify the current last entry in the chain.
Parameters: |
|
---|
Reconstruction of reference sequences from CIGAR and MD tags in SAM files.
Includes tests for various scenarios of clipping, insertion and deletion.
Decompress CIGAR string.
Parameters: | cigar (str) – full CIGAR string (numbers and letters) |
---|---|
Returns: | cigar_ext – decompressed CIGAR string (letters only) |
Return type: | str |
Decompress MD string.
Parameters: | md (str) – MD string from SAM file without MD:Z: prefix |
---|---|
Returns: | md_ext – extended MD string without MD:Z: prefix |
Return type: | str |
Reconstruct both the full and the aligned reference sequence.
Uses imformation from CIGAR string and optional MD tag to reconstruct the reference sequence. The full reference sequence including indels, clips and skips and the aligned substring can be used for calculation of identity and post-mortem degradation.
Note
Padded references should work as well, but remain untested. Use at your own risk!
Parameters: | |
---|---|
Returns: |
|
Tests reference reconstruction with 0 delimiter after deletion in MD.
Parameters: | verbose (bool) – Print additional output to STDOUT, defaults to False. |
---|---|
Returns: | True if test was successful, otherwise False. |
Return type: | bool |
Tests reconstruction despite missing 0 delimiter after deletion in MD.
Parameters: | verbose (bool) – Print additional output to STDOUT, defaults to False. |
---|---|
Returns: | True if test was successful, otherwise False. |
Return type: | bool |
Tests reference reconstruction with hard clipped 3’ read.
Parameters: | verbose (bool) – Print additional output to STDOUT, defaults to False. |
---|---|
Returns: | True if test was successful, otherwise False. |
Return type: | bool |
Tests reference reconstruction with hard clipped 5’ read.
Parameters: | verbose (bool) – Print additional output to STDOUT, defaults to False. |
---|---|
Returns: | True if test was successful, otherwise False. |
Return type: | bool |
Tests reference reconstruction with insertion to reference.
Parameters: | verbose (bool) – Print additional output to STDOUT, defaults to False. |
---|---|
Returns: | True if test was successful, otherwise False. |
Return type: | bool |
Tests reference reconstruction with skipped intron.
Parameters: | verbose (bool) – Print additional output to STDOUT, defaults to False. |
---|---|
Returns: | True if test was successful, otherwise False. |
Return type: | bool |
(De-)Serialization of SamSifter workflows.
Bases: builtins.object
(De-)Serializing workflow data for permanent storage.
Uses ElementTree and minidom libraries to write to and read from XML files.
De-serializes workflow from XML file.
Parameters: |
|
---|---|
Returns: | True on success. |
Return type: | bool |
Determines type of a typed value and return it as string.
Helper method for serialization of typed Python values to string representations in XML.
Parameters: | value (bool, str, int, float) – Typed value. |
---|---|
Returns: | String representation of value type. |
Return type: | str |
Create a pretty-printed XML string for an individual tree element.
Parameters: |
|
---|---|
Returns: | Pretty XML string representing the element. |
Return type: | str |
Serializes workflow to XML file.
Parameters: |
|
---|---|
Returns: | True on success. |
Return type: | bool |
Replaces ‘None’, ‘False’ and ‘True’ strings with actual singletons.
Helper method for deserialization from strings to actual Python singleton entities.
Parameters: | value (str) – Value string from XML file. |
---|---|
Returns: |
|
Initializes element tree from XML file.
Parameters: | filename (str) – Readable path to existing XML file. |
---|---|
Returns: | Tree representation of XML file contents. |
Return type: | ElementTree |
Write element tree to XML file.
Parameters: |
|
---|---|
Returns: | True on success. |
Return type: | bool |
Converts XML element tree to optionally indented text string.
Parameters: |
|
---|---|
Returns: | XML string representing the tree. |
Return type: | str |
Generate workflow from XML element tree.
Parameters: |
|
---|---|
Returns: |
|
Validation of workflows.
Bases: PyQt4.QtCore.QObject
Validator for SamSifter workflows.
Starts off with empty error lists for input, model and output that can be retrieved individually after running validate(). The individual steps of validation should always be run in the same order.
Lists of error messages can be retrieved individually for each of the three validation steps
by calling the corresponding getter methods. They will be reported as one concatenated list when running the full validation.
Basic filtering operations on files.
Filters specific lines from a file.
Prints only lines not contained in list to STDOUT while ignoring the number of lines in the beginning as defined by offset (default is 0 = no header). Inverse operation to print only lines contained in list by setting discard to False.
Parameters: |
|
---|---|
Returns: | True on success, False on empty list. |
Return type: | bool |
Raises: |
|
Emulates grep-like inverse pattern search.
Emulates the behaviour of grep -v -f PATTERNFILE and prints only non-matching lines to STDOUT. Inverse operation to print only lines matching at least one of the patterns by setting discard to False.
Parameters: |
|
---|---|
Returns: | True on successful filtering, False on empty pattern list or other error. |
Return type: | bool |