How HELI-Cas works and how to use it — API, pipeline, and result interpretation.
HELI-Cas is a web service that predicts whether a protein sequence is a Cas (CRISPR-associated) protein or Non-Cas. It combines profile-based homology, structural descriptors, and physicochemical features with a stacked gradient-boosting ensemble for a binary, production-grade classification.
Put one or more protein sequences in standard FASTA. Each entry must begin with a > header line that carries a unique identifier, followed by one or more lines of single-letter amino acids.
On the Analysis page, either paste the FASTA into the text box or upload a .fasta / .fa / .txt file and press Predict. The API returns a job_id immediately.
The pipeline runs BLAST-based PSSM generation, SpineX, ProFeatX, iFeature, flDPnn, and HMM scans under the hood. The front-end polls /api/status/{job_id} until the job reports completed.
Each sequence gets a label (Cas/Non-Cas), a probability averaged across folds, and a fold-vote count (0–5). Export as CSV, JSON, or a PDF report from the results panel.
Standard FASTA with single-letter amino-acid sequences. Exactly one entry per submission. Sequence length must be between 50 and 2000 amino acids (inclusive).
.fasta, .fa, .txt> header, and satisfy the length + count limits aboveB, J, O, U, X, Z) are accepted but feature extractors may handle them conservatively.No tunable parameters are exposed in the public UI — the pipeline runs a fixed, curated feature set designed for accuracy. For reference, these are the stages that execute for every submitted job.
13 ProFeatX / iFeature descriptor modules run inside major_tools.sif. PSI-BLAST against UniRef50 produces PSSM profiles; SpineX computes secondary structure and solvent accessibility; flDPnn (via fldpnn.sif) adds disorder features.
BioPython's ProteinAnalysis contributes 14 physico_* features — molecular weight, isoelectric point, GRAVY, instability and aromaticity indices, secondary-structure fractions, and net charge at pH 7.
pyhmmer 0.12 scans each sequence against Pfam-A and PGAP/TIGRFAMs. Five global features are kept: total hits, unique families, coverage ratio, Cas-specificity score, and log₁₀ of the best Cas E-value.
72 features from the union of all folds (selected via Boruta during training) are assembled per sequence, with robust handling of ProFeatX / iFeature ID mismatches.
Five stacked ensembles (fold_0.joblib … fold_4.joblib, CatBoost + LightGBM per fold) are loaded and scored. Each fold applies its own calibrated threshold; the final label is decided by majority vote (≥ 3/5 folds → Cas). The reported probability is the mean of the five fold probabilities.
| Field | Type | Meaning |
|---|---|---|
sequence_id |
string | The first token of the FASTA header line for this sequence. |
prediction |
"Cas" | "Non-Cas" | Final label from majority vote across the five folds. |
probability |
float ∈ [0, 1] | Mean of the five per-fold probabilities. Not a calibrated confidence per se, but monotonic with ensemble agreement. |
fold_votes |
integer 0–5 | Number of folds that individually called the sequence Cas at their per-fold threshold. ≥ 3 ⇒ prediction = "Cas". |
On job completion the API also returns n_sequences, n_cas, and n_noncas, mirroring the aggregate counts shown in the results panel.