Documentation

How HELI-Cas works and how to use it — API, pipeline, and result interpretation.

Overview

HELI-Cas is a web service that predicts whether a protein sequence is a Cas (CRISPR-associated) protein or Non-Cas. It combines profile-based homology, structural descriptors, and physicochemical features with a stacked gradient-boosting ensemble for a binary, production-grade classification.

At a glance
  • Binary classifier: Cas vs. Non-Cas
  • 5-fold stacked ensemble (CatBoost + LightGBM per fold)
  • 72 Boruta-selected features spanning sequence, profile, and HMM signals
  • Validation macro-F1 0.961 ± 0.012 (5-fold CV)
  • Asynchronous job API with FASTA upload and JSON body support
  • Batch-friendly — submit multi-sequence FASTAs and receive per-sequence predictions
Scope: The model distinguishes Cas from Non-Cas proteins. It does not assign a Cas subtype (e.g., Cas9 vs. Cas12 vs. Cas13). For subtype-level analysis, use the per-sequence HMM evidence exposed in the pipeline and consult dedicated CRISPR subtype annotators.

Getting Started

Step 1 — Prepare a FASTA

Put one or more protein sequences in standard FASTA. Each entry must begin with a > header line that carries a unique identifier, followed by one or more lines of single-letter amino acids.

Step 2 — Submit

On the Analysis page, either paste the FASTA into the text box or upload a .fasta / .fa / .txt file and press Predict. The API returns a job_id immediately.

Step 3 — Wait for processing

The pipeline runs BLAST-based PSSM generation, SpineX, ProFeatX, iFeature, flDPnn, and HMM scans under the hood. The front-end polls /api/status/{job_id} until the job reports completed.

Step 4 — Read the results

Each sequence gets a label (Cas/Non-Cas), a probability averaged across folds, and a fold-vote count (0–5). Export as CSV, JSON, or a PDF report from the results panel.

Each submission accepts a single protein sequence, 50–2000 amino acids long. A short sequence (< 300 aa) typically finishes in 3–6 minutes on CPU; sequences near the 2000 aa ceiling can take substantially longer because of PSI-BLAST. The job worker has a 2-hour timeout.

Input Formats

FASTA

Standard FASTA with single-letter amino-acid sequences. Exactly one entry per submission. Sequence length must be between 50 and 2000 amino acids (inclusive).

>sp|Q99ZW2|CAS9_STRP1 CRISPR-associated endonuclease Cas9/Csn1 MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEAT RLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVD ... >sp|D3E4V5|CAS2_METRM CRISPR-associated endoribonuclease Cas2 MRILVAYDISTDDKRRKVAKVLESYGQRVQYSVFECTLSTSQMNKLILELEQIIDPDEDSI RIYKLCENCSKVVVTTIGEPTKGIEFLGRGKSKEKKMELV
Accepted uploads
  • Extensions: .fasta, .fa, .txt
  • Maximum file size: 10 KB per submission (enough for one ≤ 2000 aa sequence plus header and whitespace)
  • Sequences per submission: exactly 1
  • Sequence length: 50–2000 amino acids (inclusive)
  • Validation: the input must be non-empty, begin with a > header, and satisfy the length + count limits above
  • Encoding: UTF-8; bytes that cannot be decoded are replaced, so plain ASCII FASTA is safest
Sequence notes
  • Use the 20 canonical amino-acid letters. Ambiguity codes (B, J, O, U, X, Z) are accepted but feature extractors may handle them conservatively.
  • PSI-BLAST dominates per-sequence runtime, so trimming very long sequences or removing obvious non-protein noise (e.g., leading DNA) can help throughput.
  • Sequence identifiers appearing in the results correspond to the first whitespace-delimited token of each FASTA header.

Pipeline Stages

No tunable parameters are exposed in the public UI — the pipeline runs a fixed, curated feature set designed for accuracy. For reference, these are the stages that execute for every submitted job.

1. Slim feature extraction

13 ProFeatX / iFeature descriptor modules run inside major_tools.sif. PSI-BLAST against UniRef50 produces PSSM profiles; SpineX computes secondary structure and solvent accessibility; flDPnn (via fldpnn.sif) adds disorder features.

2. Physicochemical features

BioPython's ProteinAnalysis contributes 14 physico_* features — molecular weight, isoelectric point, GRAVY, instability and aromaticity indices, secondary-structure fractions, and net charge at pH 7.

3. HMM evidence

pyhmmer 0.12 scans each sequence against Pfam-A and PGAP/TIGRFAMs. Five global features are kept: total hits, unique families, coverage ratio, Cas-specificity score, and log₁₀ of the best Cas E-value.

4. Feature selection

72 features from the union of all folds (selected via Boruta during training) are assembled per sequence, with robust handling of ProFeatX / iFeature ID mismatches.

5. Prediction

Five stacked ensembles (fold_0.joblibfold_4.joblib, CatBoost + LightGBM per fold) are loaded and scored. Each fold applies its own calibrated threshold; the final label is decided by majority vote (≥ 3/5 folds → Cas). The reported probability is the mean of the five fold probabilities.

Interpreting Results

Per-sequence output fields
Field Type Meaning
sequence_id string The first token of the FASTA header line for this sequence.
prediction "Cas" | "Non-Cas" Final label from majority vote across the five folds.
probability float ∈ [0, 1] Mean of the five per-fold probabilities. Not a calibrated confidence per se, but monotonic with ensemble agreement.
fold_votes integer 0–5 Number of folds that individually called the sequence Cas at their per-fold threshold. ≥ 3 ⇒ prediction = "Cas".
Reading the fold vote
Strong Cas (5/5)
All folds agree — highest confidence.
Borderline (3–4 / 5)
Majority Cas but some disagreement — worth cross-checking with HMM evidence.
Non-Cas (0–2 / 5)
Minority or no folds called Cas — classified Non-Cas.
Job summary counters

On job completion the API also returns n_sequences, n_cas, and n_noncas, mirroring the aggregate counts shown in the results panel.