Hritik
hbxnov.bsky.social
Hritik
@hbxnov.bsky.social
CS PhD @UCLA | Prev: Bachelors @IITDelhi, Student Researcher @GoogleDeepMind, Intern @AmazonScience | Multimodal ML, Language models | Cricket🏏

http://sites.google.com/view/hbansal
Paper: arxiv.org/abs/2412.12661
Website: mint-medmax.github.io
Code: github.com/Hritikbansal...
Demo: huggingface.co/spaces/mint-...

Thanks to the great effort by our entire group at UCLA w/
Daniel Israel, Siyan Zhao, Shufan Li, Tung Nguyen, and Aditya Grover!
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants
Recent advancements in mixed-modal generative models have enabled flexible integration of information across image-text content. These models have opened new avenues for developing unified biomedical ...
arxiv.org
December 19, 2024 at 6:19 PM
Finally, we instruction-tune Chameleon to create the MedMax-7B model. We show that our model achieves SOTA performance on multiple downstream VQA tasks and beats GPT-4o and LLaVA-Med-1.5 by a large margin.
December 19, 2024 at 6:19 PM
We also found that there is a general lack of support for multimodal biomedical evaluation. To address this, we create a robust evaluation suite consisting of visual question answering, captioning, generation, and visual chat. We make this suite publicly available.
December 19, 2024 at 6:18 PM
Overall, MedMax covers a breadth of skills and knowledge bases that will be useful for a capable biomedical assistant. We illustrate the diversity of our dataset here:
December 19, 2024 at 6:18 PM
We curate high-quality multimodal biomedical data from medical papers and YouTube to support tasks like image captioning, generation, visual chat, multimodal content creation, and report understanding across biomedical domains.
December 19, 2024 at 6:18 PM
Firstly, we create MedMax-Instruct, a synthetic data that allows interleaved generation conditions for diverse domains (radiology, histopathology). In particular, we utilize the knowledge in the image-caption datasets, followed by LLM-based data filtering and generation.
December 19, 2024 at 6:17 PM
Despite web-scale training, they underperform in biomedicine due to limited knowledge and user intent understanding.
To solve this, we curate MedMax 🏅, a large-scale biomedical vision-language dataset (1.5M instances, 1.7B tokens) for instruction-tuning a mixed-modal model.
December 19, 2024 at 6:17 PM
Mixed-modal (natively multimodal) are a new class of generative models that can flexibly integrate information between diverse modalities (e.g., Chameleon, Transfusion, Gemini 2.0). Such models can process and generate output interleaved sequences of image 📸 and text ✍️content.
December 19, 2024 at 6:16 PM