spring AI (9) Multimodal

First, let's explain what multimodality is: imagine the way humans learn, through visual, auditory, and tactile senses. The most important of these is visual. Can machines see? Of course they can. So, to put it simply, making AI see, hear, and touch is multimodality.

Humans process knowledge, simultaneously across multiple modes of data inputs. The way we learn, our experiences are all multimodal. We don’t have just vision, just audio, and just text.
人类同时跨多种数据输入模式处理知识。我们学习的方式、我们的经历都是多模式的。我们不仅有视觉，还有音频和文本。

These foundational principles of learning were articulated by the father of modern education John Amos Comenius, in his work, "Orbis Sensualium Pictus", dating back to 1658.
现代教育之父约翰・阿莫斯・夸美纽斯 (John Amos Comenius) 在其 1658 年的著作《Orbis Sensualium Pictus》中阐明了这些学习的基本原则。

"All things that are naturally connected ought to be taught in combination"
“所有自然相关的事物都应该组合起来教授”

Multimodality API#

Taking OpenAI as an example, there are not many models that support multimodality at the moment. Basically, there are only the latest gpt-4-visual-preview and gpt-4o. Detailed explanations can be found in the official documentation.

Here are two official examples:

byte[] imageData = new ClassPathResource("/multimodal.test.png").getContentAsByteArray();

var userMessage = new UserMessage("Explain what do you see on this picture?",
        List.of(new Media(MimeTypeUtils.IMAGE_PNG, imageData)));

ChatResponse response = chatModel.call(new Prompt(List.of(userMessage),
        OpenAiChatOptions.builder().withModel(OpenAiApi.ChatModel.GPT_4_VISION_PREVIEW.getValue()).build()));

var userMessage = new UserMessage("Explain what do you see on this picture?",
        List.of(new Media(MimeTypeUtils.IMAGE_PNG,
                "https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/_images/multimodal.test.png")));

ChatResponse response = chatModel.call(new Prompt(List.of(userMessage),
        OpenAiChatOptions.builder().withModel(OpenAiApi.ChatModel.GPT_4_O.getValue()).build()));

It is not difficult to see that the media information is set in the UserMessage, which represents the user's input. Previously, we only used text input, but now we know that it also supports file input.

There is also the new Media() method, which takes two different types of parameters in the above two ways. The first one takes a byte[] array, and the second one takes a URL link.

In fact, this method supports three types of parameters.

However, the first method of using a byte[] array has been marked as deprecated.

Now that we understand the principle, let's modify the previously streamed output code:

@RestController
@RequestMapping("/multi")
@RequiredArgsConstructor
@CrossOrigin
public class MultiController {

    private final OpenAiConfig openAiConfig;

    private static final Integer MAX_MESSAGE = 10;

    private static Map<String, List<Message>> chatMessage = new ConcurrentHashMap<>();

    @SneakyThrows
    private List<Message> getMessages(String id, String message, MultipartFile file) {
        String systemPrompt = "{prompt}";
        SystemPromptTemplate systemPromptTemplate = new SystemPromptTemplate(systemPrompt);

        Message userMessage = null;
        if (file == null){
             userMessage = new UserMessage(message);
        }else if (!file.isEmpty()){
            userMessage = new UserMessage(message, List.of(new Media(MimeType.valueOf(file.getContentType()), file.getResource())));
        }

        Message systemMessage = systemPromptTemplate.createMessage(MapUtil.of("prompt", "you are a helpful AI assistant"));

        List<Message> messages = chatMessage.get(id);

        if (messages == null){
            messages = new ArrayList<>();
            messages.add(systemMessage);
            messages.add(userMessage);
        } else {
            messages.add(userMessage);
        }

        return messages;
    }

    @SneakyThrows
    @GetMapping("/init/{message}")
    public String init() {
        return String.valueOf(UUID.randomUUID());
    }

    @PostMapping("/chat/{id}/{message}")
    @SneakyThrows
    public SseEmitter chat(@PathVariable String id, @PathVariable String message,
                       HttpServletResponse response, @RequestParam(value = "file", required = false) MultipartFile file ){

        response.setHeader("Content-type", "text/html;charset=UTF-8");
        response.setCharacterEncoding("UTF-8");

        SseEmitter emitter = SseEmitterUtils.connect(id);
        List<Message> messages = getMessages(id, message, file);
        System.err.println("chatMessage size: " + messages.size());
        System.err.println("chatMessage: " + chatMessage);

        if (messages.size() > MAX_MESSAGE){
            SseEmitterUtils.sendMessage(id, "Too many conversations, please try again later🤔");
        } else {
            Flux<ChatResponse> stream = openAiConfig.openAiChatClient().prompt(new Prompt(messages)).stream().chatResponse();

            Mono<String> result = stream
                    .flatMap(it -> {
                        StringBuilder sb = new StringBuilder();
                        System.err.println(it.getResult().getOutput().getContent());
                        Optional.ofNullable(it.getResult().getOutput().getContent()).ifPresent(content -> {
                            SseEmitterUtils.sendMessage(id, content);
                            sb.append(content);
                        });

                        return Mono.just(sb.toString());
                    })
                    .reduce((a, b) -> a + b)
                    .defaultIfEmpty("");

            result.subscribe(finalContent -> messages.add(new AssistantMessage(finalContent)));

            chatMessage.put(id, messages);

        }
        return emitter;

    }
}

The final result is as follows:

I'm too lazy to make further changes to the front end, so let's leave it like this🥱