Models - Guardian API

Overview

Guardian API uses four models working in parallel to provide comprehensive content moderation. Each model specializes in different aspects of harmful content detection.

Model 1: Sexism Classifier

LASSO Regression Model

Custom-trained binary classifier for sexism detection

Technical Details

Attribute	Value
Algorithm	LASSO (Least Absolute Shrinkage and Selection Operator)
Training Data	~40,000 labeled tweets
Features	2,500 n-grams (1-2) + 3 additional features
Threshold	0.400 (optimized for F1 score)
Performance	~82% F1 score on test set
Version	`sexism_lasso_v1`

Feature Engineering

The model uses a combination of text and numerical features:

Text Features
Additional Features

CountVectorizer Configuration:

max_features: 2,500
ngram_range: (1, 2)
min_df: 2 (minimum document frequency)
max_df: 0.8 (maximum document frequency)
stop_words: English (with gendered words preserved)

Preserved Gendered Words:

Pronouns: he, him, she, her, etc.
Nouns: man, woman, men, women, boy, girl
Important for detecting sexist language patterns

Text Length: Number of tokens
- Helps identify verbose vs. concise comments
Exclamation Marks: Count of !
- Captures emotional intensity
Sentiment Score: VADER compound score (-1 to 1)
- Measures overall sentiment polarity

Prediction Process

# Example prediction flow
text = "Your input text here"

# 1. Preprocess (lowercase, remove URLs, etc.)
processed = preprocess_text(text)

# 2. Vectorize text
X_text = vectorizer.transform([processed])  # 2500 features

# 3. Extract additional features
extra = [length, exclaim_count, sentiment]  # 3 features

# 4. Combine features
X_combined = hstack([X_text, extra])  # 2503 total features

# 5. Predict with LASSO
score = model.predict(X_combined)[0]  # 0.0 to 1.0

# 6. Apply threshold
is_sexist = score >= 0.400

Output Format

{
  "score": 0.724,
  "severity": "high",
  "model_version": "sexism_lasso_v1",
  "threshold_met": true
}

Model 2: Toxicity Transformer

HuggingFace Transformer

Multi-label toxicity detection using RoBERTa

Technical Details

Attribute	Value
Architecture	RoBERTa (Robustly Optimized BERT)
Model Name	`unitary/unbiased-toxic-roberta`
Type	Multi-label classification
Device	CUDA (GPU) if available, CPU fallback
Max Length	512 tokens
Version	`toxic_roberta_v1`

Toxicity Categories

The model detects 7 categories of toxicity:

Overall Toxicity

General toxic language score

Severe Toxicity

Extremely harmful content

Obscene

Vulgar or obscene language

Threat

Threatening language

Insult

Personal insults and attacks

Identity Attack

Attacks on identity groups

Sexual Explicit

Sexually explicit content

Device Management

GPU Acceleration

The toxicity model automatically uses GPU if available:

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Performance Comparison:

GPU (RTX 4050): ~10-15ms per request
CPU: ~40-60ms per request

Memory Usage:

GPU: ~2GB VRAM
CPU: ~1GB RAM

Output Format

{
  "overall": 0.742,
  "insult": 0.631,
  "threat": 0.123,
  "identity_attack": 0.412,
  "profanity": 0.584,
  "model_version": "toxic_roberta_v1"
}

The overall score is automatically set to at least the maximum of all sub-category scores.

Model 3: Rule Engine

Heuristic-Based System

Pattern matching and rule-based detection

Technical Details

Attribute	Value
Type	Rule-based heuristics
Rules	JSON configuration files
Pattern Matching	Regex + exact matching
Extensible	Easy to add new rules
Version	`rules_v1`

Rule Categories

Slurs
Threats
Self-Harm
Profanity
Style Checks

File: backend/app/models/rules/slurs.jsonDetection: Exact word matching (case-insensitive)Purpose: Identify hate speech and slursFormat:

{
  "slurs": [
    "slur1",
    "slur2",
    "..."
  ]
}

File: backend/app/models/rules/threats.jsonDetection: Regex pattern matchingPurpose: Detect violent threats and dangerous intentExample Patterns:

kill|murder|destroy + you|them|yourself
I will|I'll + kill|hurt|attack
bomb|explosive|weapon|gun

Format:

{
  "patterns": [
    "regex_pattern_1",
    "regex_pattern_2"
  ]
}

File: backend/app/models/rules/self_harm.jsonDetection: Phrase matching (case-insensitive)Purpose: Identify self-harm and suicidal ideationExample Phrases:

“kill myself”
“end my life”
“commit suicide”
“want to die”

Format:

{
  "phrases": [
    "phrase1",
    "phrase2"
  ]
}

File: backend/app/models/rules/profanity.jsonDetection: Exact word matchingPurpose: Flag profane languageFormat:

{
  "profanity": [
    "word1",
    "word2"
  ]
}

Caps Abuse: >70% uppercase characters

def detect_caps_abuse(text):
    return (uppercase_count / total_count) > 0.7

Character Repetition: 3+ repeated characters

def detect_repetition(text):
    return bool(re.search(r'(.)\1{2,}', text))

Example: “YESSSS!!!” triggers both flags

Output Format

{
  "slur_detected": false,
  "threat_detected": true,
  "self_harm_flag": false,
  "profanity_flag": true,
  "caps_abuse": false,
  "character_repetition": false,
  "model_version": "rules_v1"
}

Customization

Adding new rules is straightforward:

Edit JSON File

Navigate to backend/app/models/rules/ and edit the appropriate JSON file

Add Your Rules

For slurs/profanity: Add words to the array
For threats: Add regex patterns
For self-harm: Add phrases

Restart API

The API will automatically load the new rules on startup

Model 4: Ensemble

Aggregation Layer

Combines outputs from all three models

The ensemble model performs intelligent fusion of all model outputs. See Ensemble for details.

Model Comparison

Feature	Sexism Classifier	Toxicity Model	Rule Engine
Type	ML (LASSO)	ML (Transformer)	Heuristic
Speed	Fast (~5ms)	Medium (~15ms GPU)	Fast (~2ms)
Accuracy	High (82% F1)	High	Rule-dependent
Extensibility	Requires retraining	Requires retraining	Easy (JSON)
Resource	Low	Medium-High	Very Low
False Positives	Low	Low-Medium	Medium

Next Steps

Ensemble

Learn how models are combined

Response Structure

Understand API responses

Configuration

Configure model settings

API Reference

Try the API

Getting Started

Core Concepts

SDKs

Configuration

Use Cases

About

​Overview

​Model 1: Sexism Classifier

LASSO Regression Model

​Technical Details

​Feature Engineering

​Prediction Process

​Output Format

​Model 2: Toxicity Transformer

HuggingFace Transformer

​Technical Details

​Toxicity Categories

Overall Toxicity

Severe Toxicity

Obscene

Threat

Insult

Identity Attack

Sexual Explicit

​Device Management

​Output Format

​Model 3: Rule Engine

Heuristic-Based System

​Technical Details

​Rule Categories

​Output Format

​Customization

​Model 4: Ensemble

Aggregation Layer

​Model Comparison

​Next Steps

Ensemble

Response Structure

Configuration

API Reference

Overview

Model 1: Sexism Classifier

Technical Details

Feature Engineering

Prediction Process

Output Format

Model 2: Toxicity Transformer

Technical Details

Toxicity Categories

Device Management

Output Format

Model 3: Rule Engine

Technical Details

Rule Categories

Output Format

Customization

Model 4: Ensemble

Model Comparison

Next Steps