Kaushik Koirala
KK
The wind helping my hair mimic the shape of the Sydney Opera House

Kaushik Koirala

Software Engineer · ML & Systems

Hello! I'm Kaushik Koirala कौशिक कोइराला · (Google Translate's slightly robotic but surprisingly good Nepali pronunciation of my name, courtesy of soundoftext.com)

Welcome to my home on the internet :) I'm currently a Master's student in Electrical and Computer Engineering at Carnegie Mellon University, graduating December 2026. I've long been drawn to wholly understanding systems from high level architectures all the way down to specific implementations.

My software journey started when I graduated from the University of Texas at Austin (Hook 'em 🤘) with a degree in Computer Science. Then, at Citibank, I helped build microservices for KYC and login flows and helped advance a nascent cloud migration effort onto OpenShift. From there, at Cox Automotive, I helped build web applications and data pipelines for the car dealer ecosystem, underpinning them with scalable cloud infrastructure on AWS.

At CMU, I've kept exploring how systems work, though this time it has been in the emerging and increasingly ubiquitous AI landscape. Alongside ML coursework, I've studied HPC and GPU programming, and conducted research profiling ring all-reduce collective communication on real Google TPU v4 hardware. Looking ahead, I want to contribute to the emerging software infrastructure for ML and help build robust, reliable, responsible and safe solutions for the challenges ahead.

Outside of work: I enjoy traveling, finding new (preferably plant-based) places to eat, reading, and suffering through my favorite sports teams forgetting how to play football (Hala Madrid, Dale ATX)

Carnegie Mellon University Jan 2025 - Dec 2026

M.S. Electrical and Computer Engineering (Advanced Study)

Fast Code I & II (HPC & GPU Programming) · Intro to Deep Learning · ML in Production · Estimation, Detection & Learning · Networks in the Real World

University of Texas at Austin Aug 2016 - May 2020

B.S. Computer Science; Certificate in Applied Statistical Modeling

Carnegie Mellon University Aug 2025 – May 2026

Graduate Teaching Assistant — ML in Production (Fall 2025), Spring 2026

  • Mentoring graduate students on semester-long projects deploying production recommender systems with full MLOps stacks
  • Topics: model monitoring, CI/CD pipelines for ML, A/B testing infrastructure, data drift detection
Enverus Summer 2025

Software Engineering Intern

  • Built Streamlit visualization tooling for diffing electrical grid simulation outputs
  • Automated deployment pipelines for GPU workloads on AWS EKS
  • Productionized GPU-accelerated graph analytics using Modal, cuGraph, and ForceAtlas2
Python Streamlit AWS EKS Modal cuGraph GPU
Cox Automotive Jan 2022 – Nov 2024

Software Engineer II · promoted from SWE I, Mar 2023

  • Developed .NET/C# ETL pipelines processing automobile market data across multiple aggregation sources
  • Designed a custom auto-scaling system for AWS ECS services, responsive to both compute load and downstream Oracle DB traffic constraints
  • Built high-traffic REST APIs on containerized ECS tasks, replacing monolithic EC2-deployed applications
  • Applied DBSCAN clustering for deduplication of automobile records across automation pipelines
C#/.NET AWS ECS Oracle/PL-SQL Terraform Docker REST APIs
Citibank Aug 2020 – Jan 2022

Software Engineer, Microservices

  • Developed Java Spring Boot microservices for Citi's digital KYC automation flows
  • Migrated services from Cloud Foundry (PCF) to containerized OpenShift deployments as part of early cloud adoption
  • Drove TDD adoption and substantially improved test coverage on existing microservice codebases
Java Spring Boot OpenShift JUnit TDD
Valassis Digital May 2019 – Aug 2019

DevOps Engineering Intern

  • Designed and deployed an automated retention algorithm to identify and purge stale deployment artifacts, logs, and metadata from Rundeck and Perforce Helix Core
  • Deployed as a long-running Python app to a Kubernetes cluster via Jenkins CD pipeline
  • Instrumented the system using Prometheus and Grafana
Python Kubernetes Jenkins Prometheus Grafana

TPU Topology-Aware All-Reduce

  • Wrote custom Pallas ring-all-reduce kernels and microbenchmarked them across a 32-core TPU v4 topology (2×2×4 3D Torus, 16 chips)
  • Isolated pure hardware execution latency from host-side PCIe synchronization overhead
  • Found logical ring distance is a poor predictor of hardware latency; intra-host electrical routing outperforms longer optical hops
  • Diagonal routing pipelines comparably to intra-host speeds by using parallel X, Y, and Z axes concurrently
  • Overlapping single-axis multi-hop traffic produces significant latency penalties with high jitter
JAX/Pallas TPU v4 custom kernels distributed systems

Softmax Kernel Optimization — CPU (AVX2)

  • Hand-tuned softmax kernel for Intel Xeon Broadwell using AVX2 SIMD intrinsics
  • Three-pass design: row-wise max, 6th-order Taylor exponentiation via interleaved FMA chains, and normalization
  • Derived algorithmic performance ceiling from register pressure analysis (16 YMM registers, 7 interleaved chains)
  • Measured each pass independently to diagnose the gap; achieved near-peak throughput on the exponentiation pass
  • Substantial speedup over PyTorch for L1-resident inputs
C/C++ AVX2/SIMD performance engineering HPC

CUDA Non-Maximum Suppression — Canny Edge Detection

  • Implemented non-maximum suppression and double thresholding stages of the Canny edge detection pipeline on NVIDIA T4
  • 66×66 shared memory tiles with 1-pixel padding for boundary conditions; 256 threads per block processing 4×4 pixel regions
  • Branchless arithmetic throughout to avoid warp divergence
  • Full pipeline achieved substantial end-to-end speedup across image sizes up to 7680×4288, as compared to OpenCV implementation
CUDA GPU programming shared memory warp optimization

Hybrid Curriculum Learning — CvT-13

  • Investigated combining data-level curriculum (blurred-to-sharp images) with model-level curriculum (LeRaC layer-wise learning rates) across ResNet-18, CvT-13, and ConvNeXt-Tiny on CIFAR-10 and ImageNet-100
  • Solely responsible for implementing and evaluating all baselines and the hybrid approach for the CvT-13 model specifically across all datasets
  • Identified a fundamental conflict: LeRaC suppresses deep layers specializing in high-frequency features, while blur curriculum removes exactly those features from training data
  • Decoupling the strategies into sequential phases improved outcomes
PyTorch deep learning curriculum learning ViT/CvT
First Post
All posts →