Skip to main content

Overview

Guardian API uses four models working in parallel to provide comprehensive content moderation. Each model specializes in different aspects of harmful content detection.

Model 1: Sexism Classifier

LASSO Regression Model

Custom-trained binary classifier for sexism detection

Technical Details

AttributeValue
AlgorithmLASSO (Least Absolute Shrinkage and Selection Operator)
Training Data~40,000 labeled tweets
Features2,500 n-grams (1-2) + 3 additional features
Threshold0.400 (optimized for F1 score)
Performance~82% F1 score on test set
Versionsexism_lasso_v1

Feature Engineering

The model uses a combination of text and numerical features:
  • Text Features
  • Additional Features
CountVectorizer Configuration:
  • max_features: 2,500
  • ngram_range: (1, 2)
  • min_df: 2 (minimum document frequency)
  • max_df: 0.8 (maximum document frequency)
  • stop_words: English (with gendered words preserved)
Preserved Gendered Words:
  • Pronouns: he, him, she, her, etc.
  • Nouns: man, woman, men, women, boy, girl
  • Important for detecting sexist language patterns

Prediction Process

# Example prediction flow
text = "Your input text here"

# 1. Preprocess (lowercase, remove URLs, etc.)
processed = preprocess_text(text)

# 2. Vectorize text
X_text = vectorizer.transform([processed])  # 2500 features

# 3. Extract additional features
extra = [length, exclaim_count, sentiment]  # 3 features

# 4. Combine features
X_combined = hstack([X_text, extra])  # 2503 total features

# 5. Predict with LASSO
score = model.predict(X_combined)[0]  # 0.0 to 1.0

# 6. Apply threshold
is_sexist = score >= 0.400

Output Format

{
  "score": 0.724,
  "severity": "high",
  "model_version": "sexism_lasso_v1",
  "threshold_met": true
}

Model 2: Toxicity Transformer

HuggingFace Transformer

Multi-label toxicity detection using RoBERTa

Technical Details

AttributeValue
ArchitectureRoBERTa (Robustly Optimized BERT)
Model Nameunitary/unbiased-toxic-roberta
TypeMulti-label classification
DeviceCUDA (GPU) if available, CPU fallback
Max Length512 tokens
Versiontoxic_roberta_v1

Toxicity Categories

The model detects 7 categories of toxicity:

Overall Toxicity

General toxic language score

Severe Toxicity

Extremely harmful content

Obscene

Vulgar or obscene language

Threat

Threatening language

Insult

Personal insults and attacks

Identity Attack

Attacks on identity groups

Sexual Explicit

Sexually explicit content

Device Management

The toxicity model automatically uses GPU if available:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
Performance Comparison:
  • GPU (RTX 4050): ~10-15ms per request
  • CPU: ~40-60ms per request
Memory Usage:
  • GPU: ~2GB VRAM
  • CPU: ~1GB RAM

Output Format

{
  "overall": 0.742,
  "insult": 0.631,
  "threat": 0.123,
  "identity_attack": 0.412,
  "profanity": 0.584,
  "model_version": "toxic_roberta_v1"
}
The overall score is automatically set to at least the maximum of all sub-category scores.

Model 3: Rule Engine

Heuristic-Based System

Pattern matching and rule-based detection

Technical Details

AttributeValue
TypeRule-based heuristics
RulesJSON configuration files
Pattern MatchingRegex + exact matching
ExtensibleEasy to add new rules
Versionrules_v1

Rule Categories

  • Slurs
  • Threats
  • Self-Harm
  • Profanity
  • Style Checks
File: backend/app/models/rules/slurs.jsonDetection: Exact word matching (case-insensitive)Purpose: Identify hate speech and slursFormat:
{
  "slurs": [
    "slur1",
    "slur2",
    "..."
  ]
}

Output Format

{
  "slur_detected": false,
  "threat_detected": true,
  "self_harm_flag": false,
  "profanity_flag": true,
  "caps_abuse": false,
  "character_repetition": false,
  "model_version": "rules_v1"
}

Customization

Adding new rules is straightforward:
1

Edit JSON File

Navigate to backend/app/models/rules/ and edit the appropriate JSON file
2

Add Your Rules

  • For slurs/profanity: Add words to the array
  • For threats: Add regex patterns
  • For self-harm: Add phrases
3

Restart API

The API will automatically load the new rules on startup

Model 4: Ensemble

Aggregation Layer

Combines outputs from all three models
The ensemble model performs intelligent fusion of all model outputs. See Ensemble for details.

Model Comparison

FeatureSexism ClassifierToxicity ModelRule Engine
TypeML (LASSO)ML (Transformer)Heuristic
SpeedFast (~5ms)Medium (~15ms GPU)Fast (~2ms)
AccuracyHigh (82% F1)HighRule-dependent
ExtensibilityRequires retrainingRequires retrainingEasy (JSON)
ResourceLowMedium-HighVery Low
False PositivesLowLow-MediumMedium

Next Steps