spring AI (九) 多模態

首先解釋下什麼是多模態：想像一下人類的學習方式，有視覺，聽覺，觸覺。這其中最重要的就是視覺，機器可以看到嗎，當然可以。那么淺顯點說，讓 AI 看到，聽到，摸到就是多模態

Humans process knowledge, simultaneously across multiple modes of data inputs. The way we learn, our experiences are all multimodal. We don’t have just vision, just audio and just text.
人類同時跨多種數據輸入模式處理知識。我們學習的方式、我們的經歷都是多模式的。我們不僅有視覺，還有音頻和文本。

These foundational principles of learning were articulated by the father of modern education John Amos Comenius, in his work, "Orbis Sensualium Pictus", dating back to 1658.
現代教育之父約翰・阿莫斯・夸美紐斯 (John Amos Comenius) 在其 1658 年的著作《Orbis Sensualium Pictus》中闡明了這些學習的基本原則。

"All things that are naturally connected ought to be taught in combination"
“所有自然相關的事物都應該組合起來教授”

Multimodality API#

以 OpenAI 為例，現在支持多模態的模型還不多，基本只有最新的 gpt-4-visual-preview 和 gpt-4o 詳細的解釋在官方文檔

以下是官方的 2 種示例：

byte[] imageData = new ClassPathResource("/multimodal.test.png").getContentAsByteArray();

var userMessage = new UserMessage("Explain what do you see on this picture?",
        List.of(new Media(MimeTypeUtils.IMAGE_PNG, imageData)));

ChatResponse response = chatModel.call(new Prompt(List.of(userMessage),
        OpenAiChatOptions.builder().withModel(OpenAiApi.ChatModel.GPT_4_VISION_PREVIEW.getValue()).build()));

var userMessage = new UserMessage("Explain what do you see on this picture?",
        List.of(new Media(MimeTypeUtils.IMAGE_PNG,
                "https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/_images/multimodal.test.png")));

ChatResponse response = chatModel.call(new Prompt(List.of(userMessage),
        OpenAiChatOptions.builder().withModel(OpenAiApi.ChatModel.GPT_4_O.getValue()).build()));

不難看出，媒體信息被設置在了 UserMessage 中，這代表的是用戶的輸入，我們之前只使用了文本的輸入，現在知道了它還支持文件的輸入。

其中還有 new Media() 方法，上面 2 中方式傳遞了 2 種不同的參數，第一個傳的是 byte[] 陣列，第二個傳的是 URL 鏈接。

事實上這個方法支持 3 種參數

不過第一種 byte[] 陣列的方式已經被提示過期了。

知道原理了，接著把之前流式輸出的代碼改一下就行了：

/**
 * @author LiZhiAo
 * @date 2024/6/24 16:09
 */

@RestController
@RequestMapping("/multi")
@RequiredArgsConstructor
@CrossOrigin
public class MultiController {

    private final OpenAiConfig openAiConfig;

    private static final Integer MAX_MESSAGE = 10;

    private static Map<String, List<Message>> chatMessage = new ConcurrentHashMap<>();

    /**
     * 返回提示詞
     * @param message 用戶輸入的消息
     * @return Prompt
     */
    @SneakyThrows
    private List<Message> getMessages(String id, String message, MultipartFile file) {
        String systemPrompt = "{prompt}";
        SystemPromptTemplate systemPromptTemplate = new SystemPromptTemplate(systemPrompt);

        Message userMessage = null;
        if (file == null){
             userMessage = new UserMessage(message);
        }else if (!file.isEmpty()){
            userMessage = new UserMessage(message, List.of(new Media(MimeType.valueOf(file.getContentType()), file.getResource())));
        }

        Message systemMessage = systemPromptTemplate.createMessage(MapUtil.of("prompt", "you are a helpful AI assistant"));

        List<Message> messages = chatMessage.get(id);


        // 如果未獲取到消息，則創建新的消息並將系統提示和用戶輸入的消息添加到消息列表中
        if (messages == null){
            messages = new ArrayList<>();
            messages.add(systemMessage);
            messages.add(userMessage);
        } else {
            messages.add(userMessage);
        }

        return messages;
    }

    /**
     * 創建連接
     */
    @SneakyThrows
    @GetMapping("/init/{message}")
    public String init() {
        return String.valueOf(UUID.randomUUID());
    }

    @PostMapping("/chat/{id}/{message}")
    @SneakyThrows
    public SseEmitter chat(@PathVariable String id, @PathVariable String message,
                       HttpServletResponse response, @RequestParam(value = "file", required = false) MultipartFile file ){

        response.setHeader("Content-type", "text/html;charset=UTF-8");
        response.setCharacterEncoding("UTF-8");

        SseEmitter emitter = SseEmitterUtils.connect(id);
        List<Message> messages = getMessages(id, message, file);
        System.err.println("chatMessage大小: " + messages.size());
        System.err.println("chatMessage: " + chatMessage);


        if (messages.size() > MAX_MESSAGE){
            SseEmitterUtils.sendMessage(id, "對話次數過多，請稍後重試🤔");
        } else {
            // 獲取模型的輸出流
            Flux<ChatResponse> stream = openAiConfig.openAiChatClient().prompt(new Prompt(messages)).stream().chatResponse();

            // 把流裡面的消息使用SSE發送
            Mono<String> result = stream
                    .flatMap(it -> {
                        StringBuilder sb = new StringBuilder();
                        System.err.println(it.getResult().getOutput().getContent());
                        Optional.ofNullable(it.getResult().getOutput().getContent()).ifPresent(content -> {
                            SseEmitterUtils.sendMessage(id, content);
                            sb.append(content);
                        });

                        return Mono.just(sb.toString());
                    })
                    // 將消息拼接成字符串
                    .reduce((a, b) -> a + b)
                    .defaultIfEmpty("");

            // 將消息存儲到chatMessage中的AssistantMessage
            result.subscribe(finalContent -> messages.add(new AssistantMessage(finalContent)));

            // 將消息存儲到chatMessage中
            chatMessage.put(id, messages);

        }
        return emitter;

    }
}

最後結果如下：

前端懶得再改了，就這樣吧🥱