Architecture-aware Performance & Model Compression

GPU Performance and Memory Modeling

Student: Lime Yao

Research in this area focuses on enhancing the predictive accuracy and practical utility of STCO for large language model (LLM) workloads through abstract performance modeling of GPU-based and heterogeneous compute systems. STCO enables the co-optimization of architectural features and manufacturing technologies, allowing designers to explore how decisions made at the architectural and technology level can have non-obvious effects on system-level performance. Achieving this requires developing abstractions that balance modeling accuracy and computational tractability, enabling effective STCO-driven design exploration. Key topics include GPU memory hierarchy and performance bottlenecks, AI accelerator microarchitecture, KV cache behavior, and NUMA effects of chiplet-based or multi-GPU systems.

Architecture-aware Performance & Model Compression

Log In