Skip to main content

Overview

The ensemble layer is the “brain” of Guardian API, intelligently combining outputs from three independent models into a single, actionable moderation decision.

Ensemble Algorithm

Step 1: Weighted Score Fusion

The ensemble uses a weighted average of model scores:
ensemble_score = (
    0.35 × sexism_score +      # Model 1: Sexism classifier
    0.35 × toxicity_score +    # Model 2: Toxicity transformer
    0.30 × rule_score          # Model 3: Rule engine
)
Why 35-35-30?
  • Sexism (35%): Custom-trained, high accuracy on target domain
  • Toxicity (35%): State-of-the-art transformer, broad coverage
  • Rules (30%): High precision but limited recall, perfect for critical issues
These weights were chosen through empirical testing to balance precision and recall across different content types.
The rule score is computed based on detected flags:
Flag DetectedRule Score
Self-harm0.95 (highest)
Slur0.90
Threat0.85
Profanity0.40
None0.00
If multiple flags are detected, the maximum score is used.Additionally, if any critical rule (slur, self-harm, threat) is detected, the rule score is set to at least 0.70.

Step 2: Conflict Resolution

Rule-based detections can override the weighted average for critical issues:
1

Slur or Self-Harm Detected

If slurs or self-harm phrases are detected:
final_score = max(ensemble_score, 0.8)
Ensures high-risk content is never underestimated.
2

Threat Detected

If threat patterns are detected:
final_score = max(ensemble_score, 0.7)
Elevates score even if ML models missed it.
3

No Critical Flags

Use the weighted average as-is:
final_score = ensemble_score
Why Override?Rule-based detection has high precision (low false positives) for critical issues. When rules detect slurs, threats, or self-harm, we trust them even if ML models disagree.

Step 3: Primary Issue Identification

The ensemble identifies the main concern based on individual model scores:
if final_score >= 0.7:
    if sexism_score >= 0.6:
        primary_issue = "sexism"
    elif toxicity_score >= 0.6:
        primary_issue = "toxicity"
    elif rule_penalties:
        primary_issue = rule_penalties[0]  # e.g., "slur", "threat", "self_harm"
    else:
        primary_issue = "harmful_content"
else:
    primary_issue = "none"
This helps users understand why content was flagged.

Step 4: Summary Classification

The final score is mapped to a human-readable summary:
Score RangeSummaryMeaning
0.6 - 1.0highly_harmfulStrong evidence of harmful content
0.3 - 0.6likely_harmfulProbable harmful content
0.1 - 0.3potentially_harmfulSome harmful indicators
0.0 - 0.1likely_safeNo significant harmful content

Step 5: Severity Calculation

Severity is computed from the final score:
if score >= 0.6:
    severity = "high"
elif score >= 0.3:
    severity = "moderate"
else:
    severity = "low"

Example Scenarios

  • Agreement (Safe)
  • Agreement (Harmful)
  • Conflict (Rule Override)
  • Disagreement
Text: “I love this product!”Model Outputs:
  • Sexism: 0.05
  • Toxicity: 0.02
  • Rules: 0.00 (no flags)
Ensemble Calculation:
base_score = 0.35 × 0.05 + 0.35 × 0.02 + 0.30 × 0.00
           = 0.0245

final_score = 0.0245  (no overrides)
Result:
{
  "summary": "likely_safe",
  "primary_issue": "none",
  "score": 0.025,
  "severity": "low"
}

Design Principles

1. Conservative for Critical Issues

Fail-Safe Approach

When in doubt about threats, slurs, or self-harm, err on the side of caution. Better to flag for human review than miss dangerous content.

2. Balanced for Common Cases

Nuanced Scoring

For typical moderation cases (toxicity, rudeness), use weighted fusion to capture nuance. Not everything needs maximum severity.

3. Explainable Decisions

Primary Issue

Always identify the main reason for flagging. Users need to understand why content was moderated.

4. Configurable Weights

Tunable System

Weights can be adjusted based on use case. Social media might weight toxicity higher; professional forums might weight profanity lower.

Tuning the Ensemble

Adjusting Weights

Modify weights in backend/app/core/ensemble.py:
# Example: Prioritize sexism detection
WEIGHT_SEXISM = 0.45  # Increased from 0.35
WEIGHT_TOXICITY = 0.30  # Decreased from 0.35
WEIGHT_RULES = 0.25  # Decreased from 0.30

Adjusting Override Thresholds

Modify conflict resolution thresholds:
# Example: More aggressive overrides
if rule_flags.get("slur_detected") or rule_flags.get("self_harm_flag"):
    final_score = max(base_score, 0.9)  # Increased from 0.8

Adjusting Severity Boundaries

Modify thresholds in backend/app/config.py:
# Example: Stricter high severity
SEVERITY_HIGH = 0.7  # Increased from 0.6
SEVERITY_MODERATE = 0.4  # Increased from 0.3

Next Steps