#bioinformatics #machine-learning #embedding #dim-reduction

no-std cyanea-ml

ML primitives for bio (embeddings, etc.) for the Cyanea bioinformatics ecosystem

1 unstable release

0.1.0 Feb 21, 2026

#465 in Biology


Used in 2 crates

Apache-2.0

415KB
10K SLoC

cyanea-ml

Machine learning primitives for bioinformatics: clustering, classification, dimensionality reduction, and sequence encoding.

269 unit tests + 4 doc tests.

What's Inside

  • Clustering -- K-means, DBSCAN, hierarchical (single/complete/average/Ward linkage)
  • Distance metrics -- Euclidean, Manhattan, cosine, Hamming; pairwise distance matrices
  • KNN -- k-nearest neighbors classification and regression
  • Linear regression -- ordinary least squares via normal equation
  • Decision trees -- CART classifier with Gini impurity
  • Random forests -- bagged ensemble with bootstrap sampling and feature bagging
  • Gradient boosted trees -- regression, binary/multiclass classification, early stopping, feature importance
  • Feature selection -- variance threshold, mutual information, recursive feature elimination, Lasso L1
  • Hidden Markov models -- forward, backward, Viterbi, Baum-Welch EM
  • PCA -- principal component analysis (with optional BLAS acceleration)
  • t-SNE -- t-distributed stochastic neighbor embedding
  • UMAP -- uniform manifold approximation and projection
  • Sequence encoding -- one-hot and label encoding for DNA/RNA/protein
  • K-mer counting -- frequency vectors for ML features
  • Sequence embeddings -- k-mer frequency and composition vector embeddings
  • Classification metrics -- confusion matrix, precision/recall/F1, ROC/PR curves, AUC, MCC
  • Cross-validation -- k-fold, stratified k-fold, leave-one-out
  • Normalization -- min-max, z-score, L2 (sample-wise and column-wise)
  • Evaluation -- silhouette score for cluster quality

Quick Start

[dependencies]
cyanea-ml = "0.1"
use cyanea_ml::{
    cluster::{kmeans, KMeansConfig},
    distance::pairwise_distances,
    distance::DistanceMetric,
};

let data = vec![1.0, 2.0, 10.0, 11.0, 20.0, 21.0]; // 3 points, 2 features
let config = KMeansConfig { k: 2, max_iterations: 100, tolerance: 1e-4 };
let result = kmeans(&data, config).unwrap();
println!("Labels: {:?}", result.labels);

Feature Flags

Flag Default Description
std Yes Standard library support
wasm No WASM target marker
serde No Serialize/Deserialize derives
parallel No Rayon parallelism
blas No BLAS-accelerated PCA via ndarray

Modules

Module Description
cluster K-means, DBSCAN, hierarchical clustering
distance Distance metrics and pairwise distance matrices
encoding One-hot and label encoding for sequences
kmer K-mer counting and frequency vectors
embedding K-mer frequency and composition vector embeddings
normalize Min-max, z-score, L2 normalization
inference KNN classification/regression, linear regression
tree Decision tree classifier
forest Random forest classifier
gbdt Gradient boosted decision trees
feature_selection Variance threshold, MI, RFE, Lasso
hmm Hidden Markov Model (forward/backward/Viterbi/Baum-Welch)
reduction PCA, t-SNE
umap UMAP dimensionality reduction
metrics Confusion matrix, ROC/PR curves, F1, MCC
cross_validation K-fold, stratified k-fold, leave-one-out
evaluate Silhouette score

See Also

Dependencies

~0.7–2.1MB
~46K SLoC