scikit-bio — scikit-bio

A community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.

For Researchers

Robust, performant and scalable algorithms tailored for the vast landscape of biological data analysis spanning genomics, microbiomics, ecology, evolutionary biology and more. Built to unveil the insights hidden in complex, multi-omic data.

Example

from skbio.tree import TreeNode
from skbio.diversity import beta_diversity
from skbio.stats.ordination import pcoa

data = pd.read_table('data.tsv', index_col=0)
metadata = pd.read_table('metadata.tsv', index_col=0)
tree = TreeNode.read('tree.nwk')

bdiv = beta_diversity(
    'weighted_unifrac', data, ids=data.index, otu_ids=data.columns, tree=tree
)

ordi = pcoa(bdiv, number_of_dimensions=3)
ordi.plot(metadata, column='bodysite')

For Educators

Fundamental bioinformatics algorithms enriched by comprehensive documentation, examples and references, offering a rich resource for classroom and laboratory education (with proven success). Designed to spark curiosity and foster innovation.

Example

from skbio.alignment import global_pairwise_align_protein
from skbio.sequence.distance import hamming
from skbio.stats.distance import DistanceMatrix
from skbio.tree import nj

def align_dist(seq1, seq2):
    aln = global_pairwise_align_protein(seq1, seq2)[0]
    return hamming(aln[0], aln[1])

dm = DistanceMatrix.from_iterable(
   seqs, align_dist, keys=ids, validate=False
)

tree = nj(dm).root_at_midpoint()
print(tree.ascii_art())

          /-chicken
         |
-root----|                    /-rat
         |          /--------|
         |         |          \-mouse
          \--------|
                   |          /-pig
                   |         |
                    \--------|                    /-chimp
                             |          /--------|
                              \--------|          \-human
                                       |
                                        \-monkey

For Developers

Industry-standard, production-ready Python codebase featuring a stable, unit-tested API that streamlines development and integration. Licensed under the 3-Clause BSD, it provides an expansive platform for both academic research and commercial ventures.

Example

def centralize(mat):
    r"""Center data around its geometric average.

    Parameters
    ----------
    mat : array_like, float
        a matrix of proportions where
        rows = compositions and
        columns = components

    Returns
    -------
    numpy.ndarray
        centered composition matrix

    Examples
    --------
    >>> import numpy as np
    >>> from skbio.stats.composition import centralize
    >>> X = np.array([[.1,.3,.4, .2],[.2,.2,.2,.4]])
    >>> centralize(X)
    array([[ 0.17445763,  0.30216948,  0.34891526,  0.17445763],
           [ 0.32495488,  0.18761279,  0.16247744,  0.32495488]])

    """
    mat = closure(mat)
    cen = scipy.stats.gmean(mat, axis=0)
    return perturb_inv(mat, cen)

Install

Conda

conda install -c conda-forge scikit-bio

PyPI

pip install scikit-bio

Dev

pip install git+https://github.com/scikit-bio/scikit-bio.git

See detailed instructions on installing scikit-bio on various platforms.

News

Latest release:

scikit-bio 0.6.0

New DOE award for scikit-bio development in multi-omics and complex modeling.

Upcoming scikit-bio workshop at ISMB 2024, July 11, Montreal, Canada. Welcome to join!

New website: scikit.bio and organization: scikit-bio are online.

Feature Highlights

Biological sequences: Efficient data structure with a flexible grammar for easy manipulation, annotation, alignment, and conversion into motifs or k-mers for in-depth analysis.

Phylogenetic trees: Scalable tree structure tailored for evolutionary biology, supporting diverse operations in navigation, manipulation, comparison, and construction.

Community diversity analysis for ecological studies, with an extensive suite of metrics such as UniFrac and PD, optimized to handle large-scale community datasets.

Ordination methods, such as PCoA, CA, and RDA, to uncover patterns underlying high-dimensional data, facilitating insightful visualization.

Multivariate statistical tests, such as PERMANOVA, BIOENV, and Mantel, to decode complex relationships across data matrices and sample properties.

Compositional data processing and analysis, such as CLR transform and ANCOM, built for various omic data types from high-throughput experiments.