RouteX: Intelligent ML Routing at Scale

RouteX operates as a classification-and-routing pipeline that sits between users and a fleet of LLM endpoints.

What I Built

A serverless ML system that:

Classifies each prompt into 24 task types (code, QA, creative writing, etc.)
Estimates complexity across 6 dimensions (creativity, reasoning, domain knowledge, etc.)
Routes to the right model tier based on predicted complexity
Streams responses in real-time with cost tracking

Live Demo: https://d2y4j45z3b2c2n.cloudfront.net/chat

ML Model Architecture

Base Model: MobileBERT (25M parameters) chosen over BERT-base (110M) for 4.4x size reduction while maintaining 95%+ accuracy.

Prediction Heads:

Task Head: 24-class classification for task type prediction
Complexity Heads: 6 regression heads for continuous [0, 1] dimension estimates
Boolean Head: Binary classification for web-search detection

Dataset:

Wildchat 1M dataset (A total of 13.3k samples were filtered, each representing the first user prompt in a conversation, constrained to 512 tokens, and originating from users in India)

Training Strategy:

Weighted task loss
Weighted random sampling for class imbalance
Applied gradient clipping and warmup scheduler
Real-time monitoring via Flask dashboard tracking loss curves, F1-scores, and accuracy across all prediction heads
- The training dashboard provides live visibility into convergence, showing smooth loss decay and F1-score saturation—critical for catching training issues early and validating multi-task learning effectiveness.

Training Monitor Dashboard

Hyperparameter Optimization

Implemented Optuna-based Bayesian optimization with 50 trials:

Outcome: 8.2% F1-score improvement over baseline. Learning rate and batch size proved most important. Used Median pruner to terminate unpromising trials early.

Production Optimization

Two-Stage Optimization Pipeline

Stage 1: ONNX Export

Serializes FP32 model to ONNX format for cross-platform deployment
Enables automatic operator fusion by ONNX Runtime
Eliminates PyTorch runtime dependency (saves 500MB)

Stage 2: ONNX Runtime Optimization

Conservative thread settings for Lambda constraints
Disabled CPU memory arena (faster on server-less)
Graph optimization level: BASIC (faster init)

Model Variant Performance Comparison

The comparison was performed on a g6.2xlarge machine using CPU-only inference.

High-level performance summary

Overall accuracy: 76.41%
Weighted F1-score: 0.766
Macro F1-score: 0.612

Interpretation

The model performs reasonably well overall, driven mainly by high-support classes.
The gap between weighted F1 (0.766) and macro F1 (0.612) indicates strong class imbalance issues.
Performance is not uniform across intent types.

Serverless Architecture & MLOps

Containerized Lambda Deployment

Packaged entire system (model + dependencies) as Docker container for AWS Lambda:

Single package includes all dependencies + model files
Eliminates 250MB Lambda Layer limit (supports 10GB containers)

Cold Start Optimization

Lambda cold start reduced from 50s+ → 1.5-2s through:

Lazy Initialization: Only load model if rate limit check passes (fast rejection path)
ONNX Runtime Tuning: Graph optimization level = BASIC (faster than EXTENDED)
Global State Reuse: Model, router, API client cached in Python globals across warm invocations

Latency Breakdown:

Lambda cold start : 10-15s
Lambda warm start: 2-3s
Rate limiter check: 50ms
Model load time: 1-2s
Model classification: <200ms
Router decision: 5-10ms
External API call: Subject to prompt complexity

Rate Limiting with DynamoDB

Implemented atomic per-user-per-day rate limiting using conditional DynamoDB updates:

Check date and increment counter atomically
Automatic daily reset (no TTL needed)
Tracks total tokens used and cost incurred
Success-only recording: metrics recorded only after external API succeeds

Further Steps

Routex Model
1. Clean up the label space: Merge ultra-low support and semantically overlapping intents to remove unlearnable classes, reduce noise.
2. Fix class imbalance: Collect more samples for minority intents and/or use class-weighted loss and oversampling to prevent dominant classes from skewing learning.
3. Use better metrics: Track macro F1, per-class recall, confidence-based fallback, and calibration instead of relying on accuracy alone.
4. Add confidence thresholds & fallback: Route low-confidence predictions to other to avoid risky misclassification in production.
ML Ops
1. Move inference off Lambda to always-warm compute: Deploy the model on ECS/Fargate or EKS (or a small always-on EC2 service) to eliminate cold starts and stabilise latency for frequent requests.
2. Use lightweight, optimised models in containers: Serve RouteX via a containerised inference service with model preloaded, and horizontal auto-scaling.
3. Adopt a hybrid architecture: Keep Lambda only as a thin request router, forwarding inference to the container/EC2 service to preserves serverless flexibility while avoiding model warm-up penalties.

Page updated

Google Sites

Report abuse