qwen2.5-vl解析在线图片内容

dgiij

742人浏览 · 2025-07-27 08:54:44

dgiij · 2025-07-27 08:54:44 发布

用ollama下载了qwen2.5-vl这个视觉模型，试了下，发现图片理解能力还挺好，感谢开源贡献者的辛勤付出。
我们知道qwen2.5-vl这个视觉模型的api调用对图片有些限制，一个是只能base64编码，二是对图片的长宽有所限制，qwen2.5-vl部署时设置了max_pixels，图片的长宽不能突破这个限制，当然size大了，解析也会慢，而且也会消耗更多的tokens。
那么我们怎样解析在线的图片内容呢？首先必须将图片size缩小到限制范围内，然后对其进行base64编码。为提高效率，不写入本地文件，都在缓冲区进行转换。
我比较喜欢nodejs来编码，以下是实现：

const sharp = require('sharp');

const width = 800;
const height = 600;

async function getimgcontent(imgurl,question) {
	try {
		const response = await fetch(imgurl);
		if (!response.ok) throw new Error(`HTTP ${response.status}`);
		const buffer = Buffer.from(await response.arrayBuffer());
		const resizedImageBuffer = await sharp(buffer).resize(width, height).toBuffer();
		const base64Image = resizedImageBuffer.toString('base64');

		let result=await fetch("http://127.0.0.1:11434/api/generate", {
			method: "POST",
			headers: { "Content-Type": "application/json" },
			body: JSON.stringify({
				"model": "qwen2.5vl:7b",
				"prompt": question,
				"stream": false,
				"images": [base64Image]
				})
			}).then(response=>response.json()).then(res=>res.response);
		return result;
		} catch (error) {
			console.error('图片内容解析出错:', error);
			throw error;
			}
	}

(async () => {
	try {
		const result = await getimgcontent(yourimgurl,"图片中有什么内容？");
		console.log(result);
		} catch(error) { console.error(error); }
	})();