samsifter.tools package

Filters, tools and wrappers for the manipulation of SAM files.

This package includes a collection of stand-alone filters and tools as well as wrappers for external tools that process SAM files in a POSIX pipeline context. They can be divided into the following groups:

  • Analyzers do not remove entries from the input file but may alter the contained information by adding additional tags. They can also create statistical summaries or plots based on features of the analyzed input.
  • Converters do not remove entries but may change characteristics of the input file such as the file format, the sort order of the contained reads or the compression. They can be used to adapt the input file to requirements of the following steps in analysis.
  • Filters can remove entries from the input file based on different criteria. They may perform on any of the three levels read, reference and taxon (or a combination of these).

All modules in this package should provide an item() method which returns an instance of samsifter.models.filter.FilterItem representing the tool and its parameters in the SamSifter GUI. This item should also point to either the external command or the entry point of the script/tool/main method that is supposed to be executed. The executable main() may be located within the same module but this is not required at all.

New tools or wrappers need to be imported in samsifter.samsifter and also registered in its method samsifter.samsifter.MainWindow.populate_filters() to be selectable from the SamSifter main menu and tools dock. In case of Python scripts it may be convenient to also add an entry point in setup.py to let the Python installation routine automatically create executables that work on any of the supported operating systems.

bam_2_sam module

Wrapper for SAMtools view functionality converting BAM to SAM files.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

calculate_pmds module

Wrapper for PMDtools score calculation functionality.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

compress module

Wrapper for GNU Gzip compression functionality.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

count_taxon_reads module

Analysing filter step to count reads per taxon.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem
main()[source]

Analyzing filter step to count reads per taxon.

decompress module

Wrapper for GNU Gzip decompression functionality.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

filter_read_conservation module

Filters highly conserved reads in a SAM file.

Identifies reads assigned to multiple taxa with similar identity. Excludes reads mapping to different accessions/taxa with similar alignment scores.

item()[source]

Create FilterItem for this filter to be used in list and tree models.

Returns:Object representing this filter in item-based models.
Return type:FilterItem
main()[source]

Executable to filter SAM files for reads with high conservation.

See --help for details on expected arguments. Takes input from either STDIN, or optional, or positional arguments. Logs messages to STDERR and writes processed SAM files to STDOUT.

filter_read_identity module

Wrapper for PMDtools identity filter functionality.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

filter_read_list module

Filters reads by list of QNAMES.

Filtering reads by a list of QNAMES (read identifiers) given in a tab-separated CSV file.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem
main()[source]

Executable to filter SAM files for a list of reads.

See --help for details on expected arguments. Takes input from either STDIN, or optional, or positional arguments. Logs messages to STDERR and writes processed SAM files to STDOUT.

filter_read_pmds module

Wrapper for PMDtools ancient read filter functionality.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

filter_ref_coverage module

Identify reference accessions with uneven coverage in MALT’ed SAM files.

Comes with several methods to create optional plots of coverage and read length distributions.

Warning

Activating the plotting of these distributions for a large input dataset can create I/O problems due to the large amounts of PNG files generated. It will also decrease the performance of this filter considerably and should only be used to troubleshoot filter parameters for small subsets of the data.

calc_avg_depth(depth_dist, ignore_uncovered=True)[source]

Calculate average depth from a coverage distribution.

Optionally ignores uncovered bases (first array element).

Parameters:
  • depth_dist (array_like) – Sorted coverage depth distribution.
  • ignore_uncovered (bool, optional) – Ignore bases with zero coverage (first array element), defaults to True.
covered_length_from_cigar(cigar)[source]

Calculates length of reference covered by read from CIGAR operations.

Note

  • not counting padding, skipping or insertions into the reference.
  • not counting hard or soft clipped bases of the read
Parameters:cigar (str) – Unmodified CIGAR string from SAM file.
Returns:Length of the reference sequence.
Return type:int
get_gini_auc(x, y)[source]

Calculates Gini coefficient and area under Lorenz curve.

The Gini coefficient (also known as the Gini index) is a measure of statistical dispersion. When applied to the distribution of aligned read bases per reference base an even distribution of reads across the reference should have a low Gini coefficient (towards 0) while an alignment with all reads covering the same reference region should have a high Gini coefficient (towards 1).

Parameters:
  • x (array_like) – X coordinates of a normalized Lorenz distribution.
  • y (array_like) – Y coordinates of a normalized Lorenz distribution.
Returns:

  • float – Gini coefficient. Ranges from 0 for a perfectly even distribution to 1 for a maximally uneven distribution.
  • float – Integral of the Lorenz curve (area under curve). Maximal value is 0.5 if the given distribution is uniform.

integral_discrete(dist, limit)[source]

Integrate discrete distribution with stepsize 1 by adding up values.

Parameters:
  • dist (array_like) – Discrete distribution with stepsize 1.
  • limit (int) – Upper limit (exclusive).
Returns:

Integral of the distribution between 0 and upper limit.

Return type:

float

integral_scaled(dist, limit)[source]

Integrates scaled discrete distribution with arbitrary stepsize.

Parameters:
  • dist (array_like) – Scaled discrete distribution with arbitrary stepsize.
  • limit (float) – Upper limit (exclusive).
Returns:

Integral of the distribution between 0 and upper limit.

Return type:

float

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem
lorenzify(depth_dist)[source]

Calculates Lorenz curve from coverage distribution.

Parameters:depth_dist (array_like) – Coverage depth distribution.
Returns:
  • x (array_like) – X coordinates of Lorenz curve.
  • y (array_like) – Y coordinates of Lorenz curve.
lorenzify_b2b(depth_dist, ignore_uncovered=True)[source]

Calculate Lorenz curve from base2base coverage distribution.

Parameters:
  • depth_dist (array_like) – Coverage depth distribution.
  • ignore_uncovered (bool) – Ignore bases with zero coverage, defaults to True.
Returns:

  • x (array_like) – X coordinates of Lorenz curve.
  • y (array_like) – Y coordinates of Lorenz curve.

main()[source]

Executable to filter SAM files for references with uneven coverage.

See --help for details on expected arguments. Takes input from either STDIN, or optional, or positional arguments. Logs messages to STDERR and writes processed SAM files to STDOUT.

plot_ccd(ax, depth_dist_cumsum, avg_depth)[source]

Creates a bar plot of a cumulative coverage distribution.

Parameters:
  • ax (Axes) – Axes instance of current plot.
  • depth_dist_cumsum (array_like) – Cumulative coverage depth distribution.
  • avg_depth (float) – Mean coverage.
plot_cd(ax, depth_dist, avg_depth)[source]

Creates a bar plot of a coverage distribution.

Parameters:
  • ax (Axes) – Axes instance of current plot.
  • depth_dist (array_like) – Coverage depth distribution.
  • avg_depth (float) – Mean coverage.
plot_lorenz(ax, x, y)[source]

Creates a Lorenz curve plot of a coverage distribution.

Parameters:
  • ax (Axes) – Axes instance of current plot.
  • x (array_like) – Coverage bins.
  • Y (array_like) – Aligned bases per coverage bin.
plot_lorenz_b2b(ax, x, y)[source]

Creates a Lorenz curve plot of a base2base coverage distribution.

Parameters:
  • ax (Axes) – Axes instance of current plot.
  • x (array_like) – reference base positions.
  • Y (array_like) – Aligned bases per reference base.
plot_nccd(ax, depth_dist_cumperc, avg_depth, max_depth, avg_depth_scaled, lci, color)[source]

Creates a bar plot of a normalized cumulative coverage distribution.

Includes legend stating average scaled and total depth.

Parameters:
  • ax (Axes) – Axes instance of current plot.
  • depth_dist_cumperc (array_like) – Scaled cumulative coverage depth distribution.
  • avg_depth (float) – Mean coverage.
  • max_depth (int) – Maximum coverage depth.
  • avg_depth_scaled (float) – Scaled mean coverage.
  • lci (float) – LCI parameter.
  • color (str) – Color name to be used for the bar plot.
plot_rld(ax, length_dist, min_length, max_length, read_count, length_dist_total)[source]

Creates a bar plot of a read length distribution.

Includes scaled expected distribution based on all reads in file.

Parameters:
  • ax (Axes) – Axes instance of current plot.
  • length_dist (array_like) – Read length distribution of subset of reads.
  • min_length (int) – Minimum read length.
  • max_length (int) – Maximum read length.
  • read_count (int) – Total number of reads.
  • length_dist_total (array_like) – Read length distribution of all reads.
plot_sccd(ax, depth_dist_cumperc, avg_depth)[source]

Creates a bar plot of a scaled cumulative coverage distribution.

Parameters:
  • ax (Axes) – Axes instance of current plot.
  • depth_dist_cumperc (array_like) – Scaled cumulative coverage depth distribution.
  • avg_depth (float) – Mean coverage.
plot_scd(ax, depth_dist, avg_depth, ref_length, max_depth)[source]

Creates a bar plot of a scaled coverage distribution.

Parameters:
  • ax (Axes) – Axes instance of current plot.
  • depth_dist (array_like) – coverage depth distribution.
  • avg_depth (float) – Mean coverage.
  • ref_length (int) – Length of reference in nucleotides.
  • max_depth (int) – Maximum coverage depth.

filter_ref_identity module

Filters references by identity values of assigned reads.

This filter processes reference accessions with too few or too many reads of high or low percent identity in MALT’ed SAM files.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem
main()[source]

Executable to filter SAM files for references with low identity reads.

See --help for details on expected arguments. Takes input from either STDIN, or optional, or positional arguments. Logs messages to STDERR and writes processed SAM files to STDOUT.

filter_ref_list module

Filter references by a list of accessions.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem
main()[source]

Executable to filter SAM files for a list of references.

See --help for details on expected arguments. Takes input from either STDIN, or optional, or positional arguments. Logs messages to STDERR and writes processed SAM files to STDOUT.

filter_ref_pmds module

Filter references with high attribution of ancient reads in a MALT’ed and PMD’ed SAM file

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem
main()[source]

Executable to filter SAM files for references with ancient reads.

See --help for details on expected arguments. Takes input from either STDIN, or optional, or positional arguments. Logs messages to STDERR and writes processed SAM files to STDOUT.

filter_taxon_list module

Filter SAM files for a list of taxon IDs.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem
main()[source]

Executable to filter SAM files for a list of taxon IDs.

See --help for details on expected arguments. Takes input from either STDIN, or optional, or positional arguments. Logs messages to STDERR and writes processed SAM files to STDOUT.

filter_taxon_pmds module

Filter taxa with high attribution of ancient reads in a MALT’ed and PMD’ed SAM file

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem
main()[source]

Executable to filter SAM files for taxa with ancient reads.

See --help for details on expected arguments. Takes input from either STDIN, or optional, or positional arguments. Logs messages to STDERR and writes processed SAM files to STDOUT.

remove_duplicates module

Wrapper for SAMtools rmdup

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

sam_2_bam module

Wrapper for SAMtools view functionality to convert SAM to BAM files.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

sort_by_coordinates module

Wrapper for SAMtools sort functionality for sorting reads by coordinates.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem

sort_by_names module

Wrapper for SAMtools sort functionality for sorting reads by queryname.

item()[source]

Create item representing this tool in list and tree views.

Returns:Item for use in item-based list and tree views.
Return type:FilterItem