HGPU group
@hgpu.bsky.social
High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
#CUDA #CodeGeneration #Performance #Package
hgpu.org?p=30343
#CUDA #CodeGeneration #Performance #Package
hgpu.org?p=30343
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automati…
hgpu.org
November 9, 2025 at 4:29 PM
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
#CUDA #CodeGeneration #Performance #Package
hgpu.org?p=30343
#CUDA #CodeGeneration #Performance #Package
hgpu.org?p=30343
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
#CUDA #HIP #Compression #Package
hgpu.org?p=30342
#CUDA #HIP #Compression #Package
hgpu.org?p=30342
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
Different compilers can generate code with notably different performance characteristics – even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP …
hgpu.org
November 9, 2025 at 4:28 PM
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
#CUDA #HIP #Compression #Package
hgpu.org?p=30342
#CUDA #HIP #Compression #Package
hgpu.org?p=30342
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
#FP8 #Precision
hgpu.org?p=30341
#FP8 #Precision
hgpu.org?p=30341
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computatio…
hgpu.org
November 9, 2025 at 4:28 PM
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
#FP8 #Precision
hgpu.org?p=30341
#FP8 #Precision
hgpu.org?p=30341
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations
#CUDA #DeepLearning #DL #Package
hgpu.org?p=30330
#CUDA #DeepLearning #DL #Package
hgpu.org?p=30330
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations
Deep learning (DL) has already played a significant role in numerous fields, making it crucial to ensure the stability of both training and inference in DL systems. The computation of DL models can…
hgpu.org
November 2, 2025 at 4:05 PM
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations
#CUDA #DeepLearning #DL #Package
hgpu.org?p=30330
#CUDA #DeepLearning #DL #Package
hgpu.org?p=30330
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks
#CUDA #LLM #AutoTuning #PerformancePortability #Package
hgpu.org?p=30329
#CUDA #LLM #AutoTuning #PerformancePortability #Package
hgpu.org?p=30329
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks
Abstract Transformer-based models such as BERT and GPT2 have become the foundation of many modern applications, yet their execution requires substantial computational and memory resources. To addre…
hgpu.org
November 2, 2025 at 4:04 PM
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks
#CUDA #LLM #AutoTuning #PerformancePortability #Package
hgpu.org?p=30329
#CUDA #LLM #AutoTuning #PerformancePortability #Package
hgpu.org?p=30329
Scalable GPU-Based Integrity Verification for Large Machine Learning Models
#SYCL #oneAPI #Rust #Security #Package
hgpu.org?p=30327
#SYCL #oneAPI #Rust #Security #Package
hgpu.org?p=30327
Scalable GPU-Based Integrity Verification for Large Machine Learning Models
We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms and significantly reducing verification overheads. …
hgpu.org
November 2, 2025 at 4:02 PM
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
#CUDA #MachineLearning #ML #Package
hgpu.org?p=30326
#CUDA #MachineLearning #ML #Package
hgpu.org?p=30326
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language …
hgpu.org
November 2, 2025 at 4:02 PM
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
#CUDA #MachineLearning #ML #Package
hgpu.org?p=30326
#CUDA #MachineLearning #ML #Package
hgpu.org?p=30326
Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels
#CUDA #Chemistry #MolecularDocking #Package
hgpu.org?p=30318
#CUDA #Chemistry #MolecularDocking #Package
hgpu.org?p=30318
Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels
Tensor Cores (TCs) are specialized hardware units designed for efficient matrix multiplication and are widely utilized in deep learning workloads. However, their adoption in more irregular high-per…
hgpu.org
October 26, 2025 at 8:04 PM
Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels
#CUDA #Chemistry #MolecularDocking #Package
hgpu.org?p=30318
#CUDA #Chemistry #MolecularDocking #Package
hgpu.org?p=30318
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines
#AMD #FPGA #CodeGeneration #AI
hgpu.org?p=30316
#AMD #FPGA #CodeGeneration #AI
hgpu.org?p=30316
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines
We present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototy…
hgpu.org
October 26, 2025 at 8:03 PM
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines
#AMD #FPGA #CodeGeneration #AI
hgpu.org?p=30316
#AMD #FPGA #CodeGeneration #AI
hgpu.org?p=30316
Thesis: Compiler and Runtime Systems for Generative AI Models
#CUDA #LLM #DeepLearnig #DL #Package
hgpu.org?p=30305
#CUDA #LLM #DeepLearnig #DL #Package
hgpu.org?p=30305
Compiler and Runtime Systems for Generative AI Models
Generative AI (GenAI) workloads have rapidly become the predominant data center GPU workload. However, designing efficient GPU kernels for GenAI presents significant challenges due to two central f…
hgpu.org
October 19, 2025 at 8:41 PM
Thesis: Compiler and Runtime Systems for Generative AI Models
#CUDA #LLM #DeepLearnig #DL #Package
hgpu.org?p=30305
#CUDA #LLM #DeepLearnig #DL #Package
hgpu.org?p=30305
Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation
#SYCL #HIP #CUDA #Performance #Package
hgpu.org?p=30304
#SYCL #HIP #CUDA #Performance #Package
hgpu.org?p=30304
Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation
Specializing kernels by including runtime information during just-in-time (JIT) -compilation can improve performance at the expense of potentially generating more kernels. In this work, we contribu…
hgpu.org
October 19, 2025 at 8:40 PM
Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation
#SYCL #HIP #CUDA #Performance #Package
hgpu.org?p=30304
#SYCL #HIP #CUDA #Performance #Package
hgpu.org?p=30304
A Performance Portable Matrix Free Dense MTTKRP in GenTen
#Kokkos #CUDA #OpenMP #Package
hgpu.org?p=30302
#Kokkos #CUDA #OpenMP #Package
hgpu.org?p=30302
A Performance Portable Matrix Free Dense MTTKRP in GenTen
We extend the GenTen tensor decomposition package by introducing an accelerated dense matricized tensor times Khatri-Rao product (MTTKRP), the workhorse kernel for canonical polyadic (CP) tensor de…
hgpu.org
October 19, 2025 at 8:40 PM
A Performance Portable Matrix Free Dense MTTKRP in GenTen
#Kokkos #CUDA #OpenMP #Package
hgpu.org?p=30302
#Kokkos #CUDA #OpenMP #Package
hgpu.org?p=30302
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
#CUDA #ROCm #Performance #DeepLearning #DL #Package
hgpu.org?p=30301
#CUDA #ROCm #Performance #DeepLearning #DL #Package
hgpu.org?p=30301
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor c…
hgpu.org
October 19, 2025 at 8:35 PM
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
#CUDA #ROCm #Performance #DeepLearning #DL #Package
hgpu.org?p=30301
#CUDA #ROCm #Performance #DeepLearning #DL #Package
hgpu.org?p=30301
Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
#MLIR #OpenCL #Testing #Package
hgpu.org?p=30291
#MLIR #OpenCL #Testing #Package
hgpu.org?p=30291
Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
MLIR (Multi-Level Intermediate Representation) has rapidly become a foundational technology for modern compiler frameworks, enabling extensibility across diverse domains. However, ensuring the corr…
hgpu.org
October 12, 2025 at 2:48 PM
Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
#MLIR #OpenCL #Testing #Package
hgpu.org?p=30291
#MLIR #OpenCL #Testing #Package
hgpu.org?p=30291
ConCuR: Conciseness Makes State-of-the-Art Kernel Generation
#CUDA #CodeGeneration #LLM #DeepLearning #DL #Package
hgpu.org?p=30290
#CUDA #CodeGeneration #LLM #DeepLearning #DL #Package
hgpu.org?p=30290
ConCuR: Conciseness Makes State-of-the-Art Kernel Generation
GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the s…
hgpu.org
October 12, 2025 at 2:48 PM
ConCuR: Conciseness Makes State-of-the-Art Kernel Generation
#CUDA #CodeGeneration #LLM #DeepLearning #DL #Package
hgpu.org?p=30290
#CUDA #CodeGeneration #LLM #DeepLearning #DL #Package
hgpu.org?p=30290
Accelerating cosmological simulations on GPUs: a portable approach using OpenMP
#OpenMP #HPC #Astrophysics #Package
hgpu.org?p=30289
#OpenMP #HPC #Astrophysics #Package
hgpu.org?p=30289
Accelerating cosmological simulations on GPUs: a portable approach using OpenMP
In this work we present the porting to Graphics Processing Units (GPUs, using OpenMP target directives) and optimization of a key module within the cosmological {pinocchio} code, a Lagrangian Pertu…
hgpu.org
October 12, 2025 at 2:47 PM
Accelerating cosmological simulations on GPUs: a portable approach using OpenMP
#OpenMP #HPC #Astrophysics #Package
hgpu.org?p=30289
#OpenMP #HPC #Astrophysics #Package
hgpu.org?p=30289
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models
#CUDA #LLM #AI #DeepLearning #DL #PyTorch
hgpu.org?p=30288
#CUDA #LLM #AI #DeepLearning #DL #PyTorch
hgpu.org?p=30288
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models
CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promis…
hgpu.org
October 12, 2025 at 2:47 PM
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models
#CUDA #LLM #AI #DeepLearning #DL #PyTorch
hgpu.org?p=30288
#CUDA #LLM #AI #DeepLearning #DL #PyTorch
hgpu.org?p=30288