Explores Diffusion Transformers for text-to-image tasks, introducing DiT-Air and DiT-Air-Lite. Achieves state-of-the-art results with fewer parameters, highlighting efficient models' potential.
Explores Diffusion Transformers for text-to-image tasks, introducing DiT-Air and DiT-Air-Lite. Achieves state-of-the-art results with fewer parameters, highlighting efficient models' potential.
ETCH fits 3D body models to clothed human point clouds using SE(3) equivariance, mapping cloth-to-body surfaces. It generalizes well across poses/garments, excelling in accuracy and robustness.
ETCH fits 3D body models to clothed human point clouds using SE(3) equivariance, mapping cloth-to-body surfaces. It generalizes well across poses/garments, excelling in accuracy and robustness.
UniGoal is a novel framework for zero-shot goal-oriented navigation using a uniform graph to unify goals. Leveraging large language models, it achieves top performance on multiple benchmarks, surpassing task-specific and supervised methods.
UniGoal is a novel framework for zero-shot goal-oriented navigation using a uniform graph to unify goals. Leveraging large language models, it achieves top performance on multiple benchmarks, surpassing task-specific and supervised methods.
ConsisLoRA is a novel LoRA-based style transfer method that boosts content and style consistency by tuning LoRA weights to predict the original image, solving key style transfer challenges.
ConsisLoRA is a novel LoRA-based style transfer method that boosts content and style consistency by tuning LoRA weights to predict the original image, solving key style transfer challenges.
HSAT enhances vision model robustness in histopathology, surpassing current methods in adversarial settings and setting new healthcare model reliability benchmarks.
HSAT enhances vision model robustness in histopathology, surpassing current methods in adversarial settings and setting new healthcare model reliability benchmarks.
Introduces Dynamic Tanh (DyT), replacing normalization layers with an element-wise operation. DyT adjusts activation range, matching/exceeding traditional methods, challenging the need for normalization.
Introduces Dynamic Tanh (DyT), replacing normalization layers with an element-wise operation. DyT adjusts activation range, matching/exceeding traditional methods, challenging the need for normalization.
Maps the landscape of publicly available neural nets on Hugging Face, unearthing trends, model attributes, and aiding in prediction and documentation through real-world training practices.
Maps the landscape of publicly available neural nets on Hugging Face, unearthing trends, model attributes, and aiding in prediction and documentation through real-world training practices.
HybridVLA unifies vision-language-action with autoregressive and diffusion policies for better robotic manipulation. Collaborative training boosts performance across tasks, outperforming previous methods in simulations and real-world tests.
HybridVLA unifies vision-language-action with autoregressive and diffusion policies for better robotic manipulation. Collaborative training boosts performance across tasks, outperforming previous methods in simulations and real-world tests.
V2Edit offers training-free, instruction-led video and 3D scene editing, balancing content preservation and edits. It excels in complex scenarios with fast-moving cameras and major geometric shifts.
V2Edit offers training-free, instruction-led video and 3D scene editing, balancing content preservation and edits. It excels in complex scenarios with fast-moving cameras and major geometric shifts.
Introducing M-Attack: achieving over 90% success on black-box LVLMs like GPT-4.5 by encoding semantic details in local image regions, surpassing previous state-of-the-art attack methods.
Introducing M-Attack: achieving over 90% success on black-box LVLMs like GPT-4.5 by encoding semantic details in local image regions, surpassing previous state-of-the-art attack methods.
OVTR is an end-to-end open-vocabulary multiple object tracking system using transformers. It introduces Category Information Propagation and dual-branch structures for better classification and faster inference.
OVTR is an end-to-end open-vocabulary multiple object tracking system using transformers. It introduces Category Information Propagation and dual-branch structures for better classification and faster inference.
CoSTA* is an innovative cost-sensitive toolpath agent for multi-turn image editing. It merges LLMs and A* search to optimize AI tool use, achieving a superior cost-quality balance on new benchmarks.
CoSTA* is an innovative cost-sensitive toolpath agent for multi-turn image editing. It merges LLMs and A* search to optimize AI tool use, achieving a superior cost-quality balance on new benchmarks.
Tackles reduced diversity in distilled diffusion models by introducing diversity distillation, enhancing diversity beyond base models while maintaining efficiency.
Tackles reduced diversity in distilled diffusion models by introducing diversity distillation, enhancing diversity beyond base models while maintaining efficiency.
Shows how learning rate annealing, like cosine schedules, boosts robustness and cuts computational load, enabling sublinear convergence in large-scale model training.
Shows how learning rate annealing, like cosine schedules, boosts robustness and cuts computational load, enabling sublinear convergence in large-scale model training.
Enhances ISAC beamforming, improving spectrum efficiency and estimation accuracy using SCA for convex subproblems and lower complexity.
Enhances ISAC beamforming, improving spectrum efficiency and estimation accuracy using SCA for convex subproblems and lower complexity.
Tackles privacy risks in PFL from Membership Inference Attacks in IFCA. Introduces IFCA-MIR to enhance clustering with MIA risk assessment, boosting accuracy, fairness, and privacy.
Tackles privacy risks in PFL from Membership Inference Attacks in IFCA. Introduces IFCA-MIR to enhance clustering with MIA risk assessment, boosting accuracy, fairness, and privacy.
GraFEI uses graph neural networks to reconstruct events at the Belle II experiment, boosting B meson decay detection with invisible particles. It enhances background reduction while keeping signal efficiency high.
GraFEI uses graph neural networks to reconstruct events at the Belle II experiment, boosting B meson decay detection with invisible particles. It enhances background reduction while keeping signal efficiency high.
Explores post-training LLMs for instruction-following in English & Finnish, tackling data scarcity in small languages. Shows competitive Finnish results with few samples & best outcomes using bilingual data. Open-sourced.
Explores post-training LLMs for instruction-following in English & Finnish, tackling data scarcity in small languages. Shows competitive Finnish results with few samples & best outcomes using bilingual data. Open-sourced.
CASTLE evaluates static analyzers, formal verification tools, and LLMs on C program vulnerability detection. It highlights pros and cons of each, with a novel CASTLE Score for fair tool comparison.
CASTLE evaluates static analyzers, formal verification tools, and LLMs on C program vulnerability detection. It highlights pros and cons of each, with a novel CASTLE Score for fair tool comparison.
Examines ResNet50’s multi-instance performance on Nvidia Jetson accelerators, focusing on batch size effects on throughput, latency, and scheduling amid resource contention.
Examines ResNet50’s multi-instance performance on Nvidia Jetson accelerators, focusing on batch size effects on throughput, latency, and scheduling amid resource contention.
Tackles finding equilibria in polymatrix games with privacy. Introduces a distributed algorithm offering both accuracy and privacy as player numbers grow, advancing multi-agent systems.
Tackles finding equilibria in polymatrix games with privacy. Introduces a distributed algorithm offering both accuracy and privacy as player numbers grow, advancing multi-agent systems.
TPDiff is a temporal pyramid video diffusion model boosting video generation efficiency by adjusting frame rates during diffusion. It cuts training/inference costs and outperforms traditional methods.
TPDiff is a temporal pyramid video diffusion model boosting video generation efficiency by adjusting frame rates during diffusion. It cuts training/inference costs and outperforms traditional methods.
Manify, a Python library, enables learning non-Euclidean representations via manifold learning. It aids data embedding into non-Euclidean spaces, boosting classification, regression, and curvature estimation.
Manify, a Python library, enables learning non-Euclidean representations via manifold learning. It aids data embedding into non-Euclidean spaces, boosting classification, regression, and curvature estimation.
Explores post-training of large video models to better simulate physics, especially object freefall. Introduces PISA framework for fine-tuning and reward modeling to boost physical accuracy in video generation.
Explores post-training of large video models to better simulate physics, especially object freefall. Introduces PISA framework for fine-tuning and reward modeling to boost physical accuracy in video generation.
AF2 is an advanced Audio-Language Model improving reasoning over non-speech sounds and music. It uses a custom CLAP model, synthetic audio QA data, and a curriculum learning strategy for top results.
AF2 is an advanced Audio-Language Model improving reasoning over non-speech sounds and music. It uses a custom CLAP model, synthetic audio QA data, and a curriculum learning strategy for top results.