Peking University and the Hong Kong University of Science and Technology have unveiled a groundbreaking evolution in AI technology—the multimodal version of DeepSeek-R1, named Align-DS-V. This development, leveraging the self-developed Align-Anything framework, marks a significant leap in the application of deep reasoning to multimodal scenarios, outperforming even GPT-4o in select visual understanding benchmarks.
Align-DS-V: A New Standard in Multimodal AI
Align-DS-V showcases its advanced capabilities by seamlessly analyzing and reasoning across multiple modalities, such as text, images, audio, and video. A striking example involves its ability to evaluate a combination of text and images to identify the most suitable drink for weight loss. The model accurately described the drinks in the image, pinpointed “low-sugar original soy milk” as the optimal choice, and even explained its alignment with dietary goals.
Enhancing Text Reasoning Through Modal Penetration
One of the most compelling outcomes of this research is the discovery of “modal penetration.” During the process of transforming DeepSeek-R1 into a multimodal model, researchers observed a substantial improvement in the model’s traditional text-based reasoning capabilities. For instance, on the ARC-Challenge (5-shot), Align-DS-V improved its score from 21.4 (single modality) to 40.5 (multimodal). This leap demonstrates how integrating multimodal training can expand an AI’s reasoning boundaries and enhance its performance in scientific, mathematical, and complex reasoning tasks.
The Align-Anything Framework: A Modular Approach to Multimodal AI
The Align-Anything framework, underlying Align-DS-V, is a highly modular, scalable, and user-friendly tool for training multimodal AI. It supports fine-tuning across various modalities, including text-to-image, text-to-video, and beyond. Key features include:
- Modularity: Flexible APIs for customization and advanced extensions.
- Cross-modal Fine-tuning: Fine-tune large models across multiple modalities.
- Alignment with Human Intentions: Tailored to align AI with human values and preferences across diverse scenarios.
This framework is designed to address the challenges of aligning multimodal AI with human intentions, including capturing complex hierarchical preferences and managing hallucination phenomena in expanded input/output spaces.
Real-world Applications and Localization
Align-DS-V has been localized for specific regional needs, integrating seamlessly into Hong Kong’s ecosystem. It can process mixed-language inputs (Cantonese/English/Mandarin) and apply its reasoning capabilities to local scenarios, such as transportation updates, weather warnings, and payment systems. In educational contexts, it demonstrates clear, step-by-step solutions to complex problems, showcasing its potential for teaching and learning applications.
Open Sourcing and Future Development
Align-DS-V and the Align-Anything framework have been open-sourced to foster collaboration and innovation in the AI community. Spearheaded by the Peking University Alignment Team and the Hong Kong Generative AI R&D Center, these advancements aim to drive the development of Hong Kong’s AI ecosystem.
Looking ahead, the Peking University-Lingchu Joint Laboratory is exploring the integration of Align-DS-V into Vision Language Action (VLA) models. These models use multimodal AI to generate actions for robotics, pushing the boundaries of cross-modal reasoning and control systems.
A New Era for Multimodal AI
The release of Align-DS-V signals a transformative moment in artificial intelligence, where reasoning is no longer confined to text but extends deeply across all modalities. By integrating world knowledge and advancing cross-modal fusion, Align-DS-V sets a new standard for multimodal AI, paving the way for innovative applications and reshaping the future of AI reasoning.
For more information and access to the open-source framework, visit the official release pages provided by Peking University and HKUST.

Exciting developments in AI! Align-DS-V’s ability to blend modalities is a game changer. Can’t wait to see how it enhances real-world applications and drives innovation!