spring AI (Version 1.0) and qwen2

After a period of inactivity, Spring AI has reached version 1.0, which shows significant differences compared to 0.8.1. Recently, I have ample time to experiment with it.

Project Configuration#

There are two ways to initialize the project configuration: one is to select the corresponding dependencies directly during creation.

The other way is to configure it manually by adding the following to maven:

  <repositories>
    <repository>
      <id>spring-milestones</id>
      <name>Spring Milestones</name>
      <url>https://repo.spring.io/milestone</url>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
    <repository>
      <id>spring-snapshots</id>
      <name>Spring Snapshots</name>
      <url>https://repo.spring.io/snapshot</url>
      <releases>
        <enabled>false</enabled>
      </releases>
    </repository>
  </repositories>

Next, add Dependency Management:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>1.0.0-SNAPSHOT</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

Finally, add the dependencies for the corresponding large language models:

 <!--  OpenAI dependency   -->
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
        </dependency>

        <!--  Ollama dependency   -->
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
        </dependency>

Then, write the configuration file:

spring:
  ai:
    ollama:
      base-url: http://127.0.0.1:11434/
      chat:
        model: qwen2:7b
    openai:
      base-url: https://xxx
      api-key: sk-xxx
      chat:
        options:
          model: gpt-3.5-turbo
server:
  port: 8868

I configured two models in the configuration file, one for ollama and one for openai. Other models can be configured by referring to the documentation.

Invocation#

In version 1.0, the invocation method has changed, mainly due to changes in the instantiated objects. The latest version introduces a Chat Client API, while the Chat Model API from the previous version is still available. Their differences are as follows:

api	Scope	Function
Chat Client API	Suitable for a single model, globally unique. Multi-model configuration may cause conflicts	Top-level abstraction, this API can call all models, facilitating quick switching
Chat Model API	Singleton mode, each model is unique	Each model has its specific implementation

Chat Client#

Since the Chat Client is globally unique by default, only a single model can be configured in the configuration file; otherwise, conflicts will occur during bean initialization. Here is the official example code:

@RestController
class MyController {

    private final ChatClient chatClient;

    public MyController(ChatClient.Builder chatClientBuilder) {
        this.chatClient = chatClientBuilder.build();
    }

    @GetMapping("/ai")
    String generation(String userInput) {
        return this.chatClient.prompt()
            .user(userInput)
            .call()
            .content();
    }
}

Additionally, you can specify some model's default parameters during creation.

Create a configuration class:

@Configuration
class Config {

    @Bean
    ChatClient chatClient(ChatClient.Builder builder) {
        return builder.defaultSystem("You are a friendly chat bot that answers questions in the voice of a Pirate")
                .build();
    }

}

When using, inject with @Autowired.

To use multi-model configuration, you need to disable the automatic configuration of ChatClient.Builder:

spring:
  ai:
    chat:
      client:
        enabled: false

Then create the corresponding configuration file, taking OpenAI as an example:

/**
 * @author LiZhiAo
 * @date 2024/6/19 20:47
 */

@Component
@RequiredArgsConstructor
public class OpenAiConfig {

    private final OpenAiChatModel openAiChatModel;

    public ChatClient openAiChatClient() {
        ChatClient.Builder builder = ChatClient.builder(openAiChatModel);
        builder.defaultSystem("You are a friendly AI that answers based on user questions");

        return ChatClient.create(openAiChatModel);
    }
}

Then you can specify the model to call:

// Inject
private final OpenAiConfig openAiConfig;
// Call
Flux<ChatResponse> stream = openAiConfig.openAiChatClient().prompt(new Prompt(messages)).stream().chatResponse();

Chat Model#

Each model has its corresponding Chat Model, which is also automatically configured based on the configuration file. Taking OpenAiChatModel as an example, you can see the assembly process through the source code.

Thus, the invocation is also straightforward:

// Inject
private final OpenAiChatModel openAiChatModel;
// Call
Flux<ChatResponse> stream = openAiChatModel.stream(new Prompt(messages));

qwen2#

Some time ago, I tried using LM Studio to install llama3 and enable the Local Inference Server for debugging.

Unfortunately, while simple calls were successful, there were always errors in streaming output.

In the end, I resorted to using ollama + Open WebUI to enable the local model API.

Installation Steps#

Taking a Windows computer with an NVIDIA graphics card as an example, please refer to the Open WebUI installation method for other methods.

Install ollama (optional)
Install Docker Desktop

Run the image
If you completed step 1, install ollama on your computer:

docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda

If you skipped step 1, you can choose the image that comes with ollama:

docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Download the model
After the container runs, enter the web management page to download the model. For example, when pulling the model, enter qwen2:7b to download the 7B version of qwen2.

During steps 2 and 3, CUDA issues may arise:

 Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found

It was found that the NVIDIA driver version 555.85 might cause this.

The solution is simple: just update Docker Desktop to the latest version.

In practice, the Chinese responses from qwen2:7b are significantly better than those from llama3:8b, with the only downside being the lack of multimodal support. However, it seems the development team is already working on it 🎉

Summary#

Backend Code#

The complete Controller is as follows:

@RestController
@RequestMapping("/llama3")
@CrossOrigin
@RequiredArgsConstructor
public class llama3Controller {

    private final OpenAiConfig openAiConfig;
    private final OllamaConfig ollamaConfig;

    private static final Integer MAX_MESSAGE = 10;

    private static Map<String, List<Message>> chatMessage = new ConcurrentHashMap<>();

    /**
     * Returns the prompt
     * @param message User input message
     * @return Prompt
     */
    private List<Message> getMessages(String id, String message) {
        String systemPrompt = "{prompt}";
        SystemPromptTemplate systemPromptTemplate = new SystemPromptTemplate(systemPrompt);

        Message userMessage = new UserMessage(message);

        Message systemMessage = systemPromptTemplate.createMessage(MapUtil.of("prompt", "you are a helpful AI assistant"));

        List<Message> messages = chatMessage.get(id);

        // If no messages are retrieved, create new messages and add the system prompt and user input to the message list
        if (messages == null){
            messages = new ArrayList<>();
            messages.add(systemMessage);
            messages.add(userMessage);
        } else {
            messages.add(userMessage);
        }

        return messages;
    }

    /**
     * Create connection
     */
    @SneakyThrows
    @GetMapping("/init/{message}")
    public String init() {
        return String.valueOf(UUID.randomUUID());
    }

    @GetMapping("chat/{id}/{message}")
    public SseEmitter chat(@PathVariable String id, @PathVariable String message, HttpServletResponse response) {

        response.setHeader("Content-type", "text/html;charset=UTF-8");
        response.setCharacterEncoding("UTF-8");

        SseEmitter emitter = SseEmitterUtils.connect(id);
        List<Message> messages = getMessages(id, message);
        System.err.println("chatMessage size: " + messages.size());
        System.err.println("chatMessage: " + chatMessage);

        if (messages.size() > MAX_MESSAGE){
            SseEmitterUtils.sendMessage(id, "Too many conversations, please try again later 🤔");
        } else {
            // Get the model's output stream
            Flux<ChatResponse> stream = ollamaConfig.ollamaChatClient().prompt(new Prompt(messages)).stream().chatResponse();

            // Send messages in the stream using SSE
            Mono<String> result = stream
                    .flatMap(it -> {
                        StringBuilder sb = new StringBuilder();
                        Optional.ofNullable(it.getResult().getOutput().getContent()).ifPresent(content -> {
                            SseEmitterUtils.sendMessage(id, content);
                            sb.append(content);
                        });

                        return Mono.just(sb.toString());
                    })
                    // Concatenate messages into a string
                    .reduce((a, b) -> a + b)
                    .defaultIfEmpty("");

            // Store messages in chatMessage as AssistantMessage
            result.subscribe(finalContent -> messages.add(new AssistantMessage(finalContent)));

            // Store messages in chatMessage
            chatMessage.put(id, messages);
        }
        return emitter;
    }
}

Frontend Code#

I had GPT modify the frontend page to support MD rendering and code highlighting.

<!doctype html>
<html>

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="https://cdn.tailwindcss.com"></script>
    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.3.1/styles/default.min.css">
    <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.3.1/highlight.min.js"></script>
</head>

<body class="bg-zinc-100 dark:bg-zinc-800 min-h-screen p-4">
<div class="flex flex-col h-full">
    <div id="messages" class="flex-1 overflow-y-auto p-4 space-y-4">
        <div class="flex items-end">
            <img src="https://placehold.co/40x40" alt="avatar" class="rounded-full">
            <div class="ml-2 p-2 bg-white dark:bg-zinc-700 rounded-lg w-auto max-w-full">Hi~(⁄ ⁄•⁄ω⁄•⁄ ⁄)⁄</div>
        </div>
    </div>
    <div class="p-2">
        <input type="text" id="messageInput" placeholder="Please enter a message..."
               class="w-full p-2 rounded-lg border-2 border-zinc-300 dark:border-zinc-600 focus:outline-none focus:border-blue-500 dark:focus:border-blue-400">
        <button onclick="sendMessage()"
                class="mt-2 w-full bg-blue-500 hover:bg-blue-600 dark:bg-blue-600 dark:hover:bg-blue-700 text-white p-2 rounded-lg">Send</button>
    </div>
</div>
<script>
    let sessionId; // Used to store session ID
    let markdownBuffer = ''; // Buffer

    // Initialize marked and highlight.js
    marked.setOptions({
        highlight: function (code, lang) {
            const language = hljs.getLanguage(lang) ? lang : 'plaintext';
            return hljs.highlight(code, { language }).value;
        }
    });

    // Send HTTP request and handle response
    function sendHTTPRequest(url, method = 'GET', body = null) {
        return new Promise((resolve, reject) => {
            const xhr = new XMLHttpRequest();
            xhr.open(method, url, true);
            xhr.onload = () => {
                if (xhr.status >= 200 && xhr.status < 300) {
                    resolve(xhr.response);
                } else {
                    reject(xhr.statusText);
                }
            };
            xhr.onerror = () => reject(xhr.statusText);
            if (body) {
                xhr.setRequestHeader('Content-Type', 'application/json');
                xhr.send(JSON.stringify(body));
            } else {
                xhr.send();
            }
        });
    }

    // Handle SSE stream returned by the server
    function handleSSEStream(stream) {
        console.log('Stream started');
        const messagesContainer = document.getElementById('messages');
        const responseDiv = document.createElement('div');
        responseDiv.className = 'flex items-end';
        responseDiv.innerHTML = `
    <img src="https://placehold.co/40x40" alt="avatar" class="rounded-full">
    <div class="ml-2 p-2 bg-white dark:bg-zinc-700 rounded-lg w-auto max-w-full"></div>
  `;
        messagesContainer.appendChild(responseDiv);

        const messageContentDiv = responseDiv.querySelector('div');

        // Listen for 'message' events, triggered when the backend sends new data
        stream.onmessage = function (event) {
            const data = event.data;
            console.log('Received data:', data);

            // Append received data to the buffer
            markdownBuffer += data;

            // Attempt to parse the buffer as Markdown and display it
            messageContentDiv.innerHTML = marked.parse(markdownBuffer);

            // Use highlight.js for code highlighting
            document.querySelectorAll('pre code').forEach((block) => {
                hljs.highlightElement(block);
            });

            // Keep the scrollbar at the bottom
            messagesContainer.scrollTop = messagesContainer.scrollHeight;
        };
    }

    // Send message
    function sendMessage() {
        const input = document.getElementById('messageInput');
        const message = input.value.trim();
        if (message) {
            const messagesContainer = document.getElementById('messages');
            const newMessageDiv = document.createElement('div');
            newMessageDiv.className = 'flex items-end justify-end';
            newMessageDiv.innerHTML = `
          <div class="mr-2 p-2 bg-green-200 dark:bg-green-700 rounded-lg max-w-xs">
            ${message}
          </div>
          <img src="https://placehold.co/40x40" alt="avatar" class="rounded-full">
        `;
            messagesContainer.appendChild(newMessageDiv);
            input.value = '';
            messagesContainer.scrollTop = messagesContainer.scrollHeight;

            // On the first message sent, send an init request to get the session ID
            if (!this.sessionId) {
                console.log('init');
                sendHTTPRequest(`http://127.0.0.1:8868/llama3/init/${message}`, 'GET')
                    .then(response => {
                        this.sessionId = response; // Store session ID
                        return handleSSEStream(new EventSource(`http://127.0.0.1:8868/llama3/chat/${this.sessionId}/${message}`))
                    });

            } else {
                // Subsequent requests are sent directly to the chat interface
                handleSSEStream(new EventSource(`http://127.0.0.1:8868/llama3/chat/${this.sessionId}/${message}`))
            }
        }
    }
</script>
</body>

</html>

Spring AI
Open WebUI
2024 Latest Spring AI Beginner to Expert Tutorial (A complete guide to AI large model application development)
The document corresponding to the above video (Password: wrp6)