RouteX operates as a classification-and-routing pipeline that sits between users and a fleet of LLM endpoints.
A serverless ML system that:
Classifies each prompt into 24 task types (code, QA, creative writing, etc.)
Estimates complexity across 6 dimensions (creativity, reasoning, domain knowledge, etc.)
Routes to the right model tier based on predicted complexity
Streams responses in real-time with cost tracking
Live Demo: https://d2y4j45z3b2c2n.cloudfront.net/chat
Tech Stack: PyTorch | Hugging Face Transformers | BERT | Optuna | ONNX Runtime | AWS Lambda | Docker | DynamoDB | AWS ECR | API GateWay | Cloud Front | AWS S3 | CloudWatch
Base Model: MobileBERT (25M parameters) chosen over BERT-base (110M) for 4.4x size reduction while maintaining 95%+ accuracy.
Prediction Heads:
Task Head: 24-class classification for task type prediction
Complexity Heads: 6 regression heads for continuous [0, 1] dimension estimates
Boolean Head: Binary classification for web-search detection
Dataset:
Wildchat 1M dataset (A total of 13.3k samples were filtered, each representing the first user prompt in a conversation, constrained to 512 tokens, and originating from users in India)
Training Strategy:
Weighted task loss
Weighted random sampling for class imbalance
Applied gradient clipping and warmup scheduler
Real-time monitoring via Flask dashboard tracking loss curves, F1-scores, and accuracy across all prediction heads
The training dashboard provides live visibility into convergence, showing smooth loss decay and F1-score saturation—critical for catching training issues early and validating multi-task learning effectiveness.
Training Monitor Dashboard
Implemented Optuna-based Bayesian optimization with 50 trials:
Outcome: 8.2% F1-score improvement over baseline. Learning rate and batch size proved most important. Used Median pruner to terminate unpromising trials early.
Two-Stage Optimization Pipeline
Serializes FP32 model to ONNX format for cross-platform deployment
Enables automatic operator fusion by ONNX Runtime
Eliminates PyTorch runtime dependency (saves 500MB)
Conservative thread settings for Lambda constraints
Disabled CPU memory arena (faster on server-less)
Graph optimization level: BASIC (faster init)
The comparison was performed on a g6.2xlarge machine using CPU-only inference.
Overall accuracy: 76.41%
Weighted F1-score: 0.766
Macro F1-score: 0.612
Interpretation
The model performs reasonably well overall, driven mainly by high-support classes.
The gap between weighted F1 (0.766) and macro F1 (0.612) indicates strong class imbalance issues.
Performance is not uniform across intent types.
Packaged entire system (model + dependencies) as Docker container for AWS Lambda:
Single package includes all dependencies + model files
Eliminates 250MB Lambda Layer limit (supports 10GB containers)
Lambda cold start reduced from 50s+ → 1.5-2s through:
Lazy Initialization: Only load model if rate limit check passes (fast rejection path)
ONNX Runtime Tuning: Graph optimization level = BASIC (faster than EXTENDED)
Global State Reuse: Model, router, API client cached in Python globals across warm invocations
Latency Breakdown:
Lambda cold start : 10-15s
Lambda warm start: 2-3s
Rate limiter check: 50ms
Model load time: 1-2s
Model classification: <200ms
Router decision: 5-10ms
External API call: Subject to prompt complexity
Implemented atomic per-user-per-day rate limiting using conditional DynamoDB updates:
Check date and increment counter atomically
Automatic daily reset (no TTL needed)
Tracks total tokens used and cost incurred
Success-only recording: metrics recorded only after external API succeeds
Routex Model
Clean up the label space: Merge ultra-low support and semantically overlapping intents to remove unlearnable classes, reduce noise.
Fix class imbalance: Collect more samples for minority intents and/or use class-weighted loss and oversampling to prevent dominant classes from skewing learning.
Use better metrics: Track macro F1, per-class recall, confidence-based fallback, and calibration instead of relying on accuracy alone.
Add confidence thresholds & fallback: Route low-confidence predictions to other to avoid risky misclassification in production.
ML Ops
Move inference off Lambda to always-warm compute: Deploy the model on ECS/Fargate or EKS (or a small always-on EC2 service) to eliminate cold starts and stabilise latency for frequent requests.
Use lightweight, optimised models in containers: Serve RouteX via a containerised inference service with model preloaded, and horizontal auto-scaling.
Adopt a hybrid architecture: Keep Lambda only as a thin request router, forwarding inference to the container/EC2 service to preserves serverless flexibility while avoiding model warm-up penalties.