The wind helping my hair mimic the shape of the Sydney Opera House

Kaushik Koirala

Software Engineer · ML & Systems

Hello! I'm Kaushik Koirala कौशिक कोइराला · (Google Translate's slightly robotic but surprisingly good Nepali pronunciation of my name, courtesy of soundoftext.com)

Welcome to my home on the internet :) I'm currently a Master's student in Electrical and Computer Engineering at Carnegie Mellon University, graduating December 2026. I've long been drawn to wholly understanding systems from high level architectures all the way down to specific implementations.

My software journey started when I graduated from the University of Texas at Austin (Hook 'em 🤘) with a degree in Computer Science. Then, at Citibank, I helped build microservices for KYC and login flows and helped advance a nascent cloud migration effort onto OpenShift. From there, at Cox Automotive, I helped build web applications and data pipelines for the car dealer ecosystem, underpinning them with scalable cloud infrastructure on AWS.

At CMU, I've kept exploring how systems work, though this time it has been in the emerging and increasingly ubiquitous AI landscape. Alongside ML coursework, I've studied HPC and GPU programming, and conducted research profiling ring all-reduce collective communication on real Google TPU v4 hardware. Looking ahead, I want to contribute to the emerging software infrastructure for ML and help build robust, reliable, responsible and safe solutions for the challenges ahead.

Outside of work: I enjoy traveling, finding new (preferably plant-based) places to eat, reading, and suffering through my favorite sports teams forgetting how to play football (Hala Madrid, Dale ATX)

Education

Carnegie Mellon University Jan 2025 - Dec 2026

M.S. Electrical and Computer Engineering (Advanced Study)

Fast Code I & II (HPC & GPU Programming) · Intro to Deep Learning · ML in Production · Estimation, Detection & Learning · Networks in the Real World

University of Texas at Austin Aug 2016 - May 2020

B.S. Computer Science; Certificate in Applied Statistical Modeling

Experience

Carnegie Mellon University Aug 2025 – May 2026

Graduate Teaching Assistant — ML in Production (Fall 2025), Spring 2026

Mentoring graduate students on semester-long projects deploying production recommender systems with full MLOps stacks
Topics: model monitoring, CI/CD pipelines for ML, A/B testing infrastructure, data drift detection

Enverus Summer 2025

Software Engineering Intern

Built Streamlit visualization tooling for diffing electrical grid simulation outputs
Automated deployment pipelines for GPU workloads on AWS EKS
Productionized GPU-accelerated graph analytics using Modal, cuGraph, and ForceAtlas2

Python Streamlit AWS EKS Modal cuGraph GPU

Cox Automotive Jan 2022 – Nov 2024

Software Engineer II · promoted from SWE I, Mar 2023

Developed .NET/C# ETL pipelines processing automobile market data across multiple aggregation sources
Designed a custom auto-scaling system for AWS ECS services, responsive to both compute load and downstream Oracle DB traffic constraints
Built high-traffic REST APIs on containerized ECS tasks, replacing monolithic EC2-deployed applications
Applied DBSCAN clustering for deduplication of automobile records across automation pipelines

C#/.NET AWS ECS Oracle/PL-SQL Terraform Docker REST APIs

Citibank Aug 2020 – Jan 2022

Software Engineer, Microservices

Developed Java Spring Boot microservices for Citi's digital KYC automation flows
Migrated services from Cloud Foundry (PCF) to containerized OpenShift deployments as part of early cloud adoption
Drove TDD adoption and substantially improved test coverage on existing microservice codebases

Java Spring Boot OpenShift JUnit TDD

Valassis Digital May 2019 – Aug 2019

DevOps Engineering Intern

Designed and deployed an automated retention algorithm to identify and purge stale deployment artifacts, logs, and metadata from Rundeck and Perforce Helix Core
Deployed as a long-running Python app to a Kubernetes cluster via Jenkins CD pipeline
Instrumented the system using Prometheus and Grafana

Python Kubernetes Jenkins Prometheus Grafana

Projects

TPU Topology-Aware All-Reduce

Wrote custom Pallas ring-all-reduce kernels and microbenchmarked them across a 32-core TPU v4 topology (2×2×4 3D Torus, 16 chips)
Isolated pure hardware execution latency from host-side PCIe synchronization overhead
Found logical ring distance is a poor predictor of hardware latency; intra-host electrical routing outperforms longer optical hops
Diagonal routing pipelines comparably to intra-host speeds by using parallel X, Y, and Z axes concurrently
Overlapping single-axis multi-hop traffic produces significant latency penalties with high jitter

JAX/Pallas TPU v4 custom kernels distributed systems

Softmax Kernel Optimization — CPU (AVX2)

Hand-tuned softmax kernel for Intel Xeon Broadwell using AVX2 SIMD intrinsics
Three-pass design: row-wise max, 6th-order Taylor exponentiation via interleaved FMA chains, and normalization
Derived algorithmic performance ceiling from register pressure analysis (16 YMM registers, 7 interleaved chains)
Measured each pass independently to diagnose the gap; achieved near-peak throughput on the exponentiation pass
Substantial speedup over PyTorch for L1-resident inputs

C/C++ AVX2/SIMD performance engineering HPC

CUDA Non-Maximum Suppression — Canny Edge Detection

Implemented non-maximum suppression and double thresholding stages of the Canny edge detection pipeline on NVIDIA T4
66×66 shared memory tiles with 1-pixel padding for boundary conditions; 256 threads per block processing 4×4 pixel regions
Branchless arithmetic throughout to avoid warp divergence
Full pipeline achieved substantial end-to-end speedup across image sizes up to 7680×4288, as compared to OpenCV implementation

CUDA GPU programming shared memory warp optimization

Hybrid Curriculum Learning — CvT-13

Investigated combining data-level curriculum (blurred-to-sharp images) with model-level curriculum (LeRaC layer-wise learning rates) across ResNet-18, CvT-13, and ConvNeXt-Tiny on CIFAR-10 and ImageNet-100
Solely responsible for implementing and evaluating all baselines and the hybrid approach for the CvT-13 model specifically across all datasets
Identified a fundamental conflict: LeRaC suppresses deep layers specializing in high-frequency features, while blur curriculum removes exactly those features from training data
Decoupling the strategies into sequential phases improved outcomes

PyTorch deep learning curriculum learning ViT/CvT

Writing

May 15, 2026 First Post

All posts →