A binary classifier that identifies whether a single input protein sequence is a CRISPR-associated (Cas) protein or not.
HELI-Cas accepts one protein sequence (50–2000 aa) in FASTA format and returns a Cas / Non-Cas prediction with an averaged confidence score and a per-fold vote count. It is intended as a first-pass screen for candidate Cas proteins in hypothetical protein sets or newly assembled genomes, not as a Cas type or subtype (Type I–VI, Class 1/2) caller — subtyping is out of scope.
A fixed pipeline extracts multi-modal features — ProFeatX / iFeature descriptors, BioPython physicochemical indices, PSI-BLAST PSSM profiles (UniRef50), SpineX secondary-structure and solvent accessibility, flDPnn disorder, and pyhmmer scans against Pfam-A and PGAP/TIGRFAMs — reduces them to a 72-feature Boruta-selected set, and scores each sequence with a 5-fold stacked ensemble (CatBoost + LightGBM per fold). The final label is decided by majority vote (≥ 3/5 folds → Cas); the reported probability is the mean of the five fold probabilities. See the Documentation page for full pipeline details.
1 FASTA record, 50–2000 amino acids. Max file size 10 KB.
Cas / Non-Cas label, mean probability, fold-vote count (0–5). Exportable as CSV, JSON, or PDF.
3–6 minutes for a short sequence on CPU; longer for sequences near the 2000 aa ceiling. Worker timeout: 2 hours.
HELI-Cas: Advanced Machine Learning Framework for Cas Protein Classification
Author Name et al. (2024)
Journal of Bioinformatics, Volume XX, Issue X