Alibaba Cloud has announced the release of two open-source large vision language models designed to comprehend both images and text. These models, named Qwen-VL and Qwen-VL-Chat, are now accessible for download on Alibaba Cloud’s AI model community, ModelScope, as well as the collaborative AI platform provided by Hugging Face.
Notably, both Qwen-VL and Qwen-VL-Chat are equipped to handle input in both English and Chinese, showcasing their versatility. They exhibit the capability to undertake various visual tasks, including responding to open-ended questions based on multiple images and generating descriptive captions for images. Qwen-VL-Chat takes this a step further by tackling more complex tasks like mathematical calculations and crafting narratives using multiple images.
These models have been trained using the 7-billion-parameter iteration of the Qwen-7B large language model, which was recently made open-source by Alibaba Cloud. One of the key advantages highlighted by Alibaba Cloud is that Qwen-VL surpasses other open-source large vision language models in comprehending images in higher resolutions, leading to improved image recognition and understanding capabilities.
This release underscores Alibaba Cloud’s commitment to advancing multi-modal abilities within its large language models. These capabilities allow the models to process diverse types of data, including text, images, and audio. This integration of multiple sensory inputs opens up new avenues for research and commercial applications.
The potential applications of these models are promising. For instance, they could revolutionize user interactions with visual content. Researchers and businesses could leverage these models to automatically generate captions for images in news articles, or even assist individuals who are not proficient in reading Chinese street signs.
One particularly notable use case involves enhancing accessibility for visually impaired users. Alibaba’s Taobao online marketplace had previously introduced Optical Character Recognition technology to aid visually impaired individuals in reading text within images, like product descriptions. The newly introduced large vision language models simplify this process further by enabling visually impaired users to engage in multi-round conversations with the model to extract information from images.
In the past month alone, Alibaba Cloud’s pre-trained 7-billion-parameter large language model Qwen-7B, and its conversationally enhanced variant Qwen-7B-Chat, have collectively garnered over 400,000 downloads. These models were initially made available to facilitate developers, researchers, and businesses in constructing their own cost-effective generative AI models.
(Source: Alizila)