Umang Kaushik
AI Engineer · ML Researcher
Pioneered SimplifAI, an agentic AI system for automated incident resolution using Google ADK, reducing MTTR and improving SLA adherence. Built full-stack with GraphQL, FastAPI, React, and ShadCN. Designed high-throughput alert pipelines with RabbitMQ, Redis, and Elasticsearch. Deployed on AWS EC2 with Docker, Nginx, and GitHub Actions CI/CD. Presented SimplifAI at India Mobile Congress 2025.
Engineered Nagios monitoring scripts reducing MTTD by 20%. Automated SSL certificate lifecycles with Certbot and Route53, eliminating certificate-related downtime. Tuned CloudWatch alarms to maintain 99.9% availability, achieving 15% P99 latency reduction and 10% cloud cost decrease.
Implemented PagedAttention from the vLLM paper using custom Triton GPU kernels for memory-efficient LLM inference. Built paged KV cache with block allocation, reference counting, and prefix caching. Developed continuous batching scheduler with chunked prefill.
Implemented Flash Attention 2 in CUDA using shared memory tiling and online softmax, reducing memory complexity from O(n^2) to O(n).
Fine-tuned Qwen2.5 3B on math reasoning using GRPO loss. Implemented reward hacking and RL post-training with HuggingFace TRL. Used Unsloth and vLLM for quantized multi-GPU training. Explored the lower bound of reasoning capability in small models.
Hybrid Anki CLI for humans and AI agents. Supports both AnkiConnect and direct SQLite backends. Features a full search query language compiled to SQL, interactive TUI review mode, and ships a SKILL.md for autonomous agent integration. Published on PyPI.
Mobile agent that understands natural language and executes actions on Android. Uses an LLM as the reasoning engine with phone capabilities exposed as callable tools. Built with Expo SDK 54, React Native, Tamagui, and Zustand.
PyTorch implementation of DeepSeek's Engram paper, augmenting transformer attention with n-gram memory retrieval via hash-based lookup and learned gating. Trained on WikiText-103 with Modal cloud deployment.
PyTorch implementation of DeepSeek's Manifold-constrained Hyper-Connections, integrating multi-stream transformer routing into NanoGPT. Includes a Modal cloud training pipeline on A100 GPUs.
GPU-accelerated ray tracer ported from a custom CPU framework to CUDA. Implements materials, spheres, camera systems, and hittable lists entirely on the GPU with clean header-only architecture.
A tiny C compiler written in Python. Handles lexing, parsing, and x86 assembly code generation. Supports functions and basic C constructs. Based on Nora Sandler's incremental compiler approach.
Tiny autograd engine built from scratch. Implements reverse-mode automatic differentiation, a neural network module system, and trains on scikit-learn's make_moons dataset as a demonstration.
Compiler Durden · ubermenchh · last updated Feb 2026