spring AI (九) 多模态

首先解释下什么是多模态：想象一下人类的学习方式，有视觉，听觉，触觉。这其中最重要的就是视觉，机器可以看到吗，当然可以。那么浅显点说，让 AI 看到，听到，摸到就是多模态

Humans process knowledge, simultaneously across multiple modes of data inputs. The way we learn, our experiences are all multimodal. We don’t have just vision, just audio and just text.
人类同时跨多种数据输入模式处理知识。我们学习的方式、我们的经历都是多模式的。我们不仅有视觉，还有音频和文本。

These foundational principles of learning were articulated by the father of modern education John Amos Comenius, in his work, "Orbis Sensualium Pictus", dating back to 1658.
现代教育之父约翰・阿莫斯・夸美纽斯 (John Amos Comenius) 在其 1658 年的著作《Orbis Sensualium Pictus》中阐明了这些学习的基本原则。

"All things that are naturally connected ought to be taught in combination"
“所有自然相关的事物都应该组合起来教授”

Multimodality API#

以 OpenAI 为例，现在支持多模态的模型还不多，基本只有最新的 gpt-4-visual-preview 和 gpt-4o 详细的解释在官方文档

以下是官方的 2 种示例：

byte[] imageData = new ClassPathResource("/multimodal.test.png").getContentAsByteArray();

var userMessage = new UserMessage("Explain what do you see on this picture?",
        List.of(new Media(MimeTypeUtils.IMAGE_PNG, imageData)));

ChatResponse response = chatModel.call(new Prompt(List.of(userMessage),
        OpenAiChatOptions.builder().withModel(OpenAiApi.ChatModel.GPT_4_VISION_PREVIEW.getValue()).build()));

var userMessage = new UserMessage("Explain what do you see on this picture?",
        List.of(new Media(MimeTypeUtils.IMAGE_PNG,
                "https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/_images/multimodal.test.png")));

ChatResponse response = chatModel.call(new Prompt(List.of(userMessage),
        OpenAiChatOptions.builder().withModel(OpenAiApi.ChatModel.GPT_4_O.getValue()).build()));

不难看出，媒体信息被设置在了 UserMessage 中，这代表的是用户的输入，我们之前只使用了文本的输入，现在知道了它还支持文件的输入。

其中还有 new Media() 方法，上面 2 中方式传递了 2 种不同的参数，第一个传的是 byte[] 数组，第二个传的是 URL 链接。

事实上这个方法支持 3 种参数

不过第一种 byte[] 数组的方式已经被提示过期了。

知道原理了，接着把之前流式输出的代码改一下就行了：

/**
 * @author LiZhiAo
 * @date 2024/6/24 16:09
 */

@RestController
@RequestMapping("/multi")
@RequiredArgsConstructor
@CrossOrigin
public class MultiController {

    private final OpenAiConfig openAiConfig;

    private static final Integer MAX_MESSAGE = 10;

    private static Map<String, List<Message>> chatMessage = new ConcurrentHashMap<>();

    /**
     * 返回提示词
     * @param message 用户输入的消息
     * @return Prompt
     */
    @SneakyThrows
    private List<Message> getMessages(String id, String message, MultipartFile file) {
        String systemPrompt = "{prompt}";
        SystemPromptTemplate systemPromptTemplate = new SystemPromptTemplate(systemPrompt);

        Message userMessage = null;
        if (file == null){
             userMessage = new UserMessage(message);
        }else if (!file.isEmpty()){
            userMessage = new UserMessage(message, List.of(new Media(MimeType.valueOf(file.getContentType()), file.getResource())));
        }

        Message systemMessage = systemPromptTemplate.createMessage(MapUtil.of("prompt", "you are a helpful AI assistant"));

        List<Message> messages = chatMessage.get(id);


        // 如果未获取到消息，则创建新的消息并将系统提示和用户输入的消息添加到消息列表中
        if (messages == null){
            messages = new ArrayList<>();
            messages.add(systemMessage);
            messages.add(userMessage);
        } else {
            messages.add(userMessage);
        }

        return messages;
    }

    /**
     * 创建连接
     */
    @SneakyThrows
    @GetMapping("/init/{message}")
    public String init() {
        return String.valueOf(UUID.randomUUID());
    }

    @PostMapping("/chat/{id}/{message}")
    @SneakyThrows
    public SseEmitter chat(@PathVariable String id, @PathVariable String message,
                       HttpServletResponse response, @RequestParam(value = "file", required = false) MultipartFile file ){

        response.setHeader("Content-type", "text/html;charset=UTF-8");
        response.setCharacterEncoding("UTF-8");

        SseEmitter emitter = SseEmitterUtils.connect(id);
        List<Message> messages = getMessages(id, message, file);
        System.err.println("chatMessage大小: " + messages.size());
        System.err.println("chatMessage: " + chatMessage);


        if (messages.size() > MAX_MESSAGE){
            SseEmitterUtils.sendMessage(id, "对话次数过多，请稍后重试🤔");
        } else {
            // 获取模型的输出流
            Flux<ChatResponse> stream = openAiConfig.openAiChatClient().prompt(new Prompt(messages)).stream().chatResponse();

            // 把流里面的消息使用SSE发送
            Mono<String> result = stream
                    .flatMap(it -> {
                        StringBuilder sb = new StringBuilder();
                        System.err.println(it.getResult().getOutput().getContent());
                        Optional.ofNullable(it.getResult().getOutput().getContent()).ifPresent(content -> {
                            SseEmitterUtils.sendMessage(id, content);
                            sb.append(content);
                        });

                        return Mono.just(sb.toString());
                    })
                    // 将消息拼接成字符串
                    .reduce((a, b) -> a + b)
                    .defaultIfEmpty("");

            // 将消息存储到chatMessage中的AssistantMessage
            result.subscribe(finalContent -> messages.add(new AssistantMessage(finalContent)));

            // 将消息存储到chatMessage中
            chatMessage.put(id, messages);

        }
        return emitter;

    }
}

最后结果如下：

前端懒得再改了，就这样吧🥱