spring AI (5) Local Deployment Model - llama3

In the previous chapters, we used large online models like ChatGPT. Although the API prices for various large models are decreasing, the costs can still add up significantly with extensive usage, not to mention various limitations. Deploying smaller models locally can often provide good results in many cases, not only reducing costs but also having fewer restrictions, and most importantly, allowing for some customization. This is undoubtedly more attractive for companies and organizations.

This section will introduce how to install and deploy the latest Llama3 model using Ollama and call it with Spring AI.

Installing Ollama#

Installing Ollama is straightforward; just download the installation package for your operating system from the official website and install it. After installation and startup, an icon will appear in the system tray.

Then open the terminal and enter ollama -v to verify the version.

At this point, Ollama is installed, but it only supports command-line interaction. If you want to use it in a graphical interface, you can install a webUI (this step can be skipped if you are only calling the code).

For Windows systems, if you have already installed WSL and Docker, you can run it using the following commands:

(1) Run under CPU:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

(2) Support GPU running:

docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

After installation, access it via the local address: http://127.0.0.1:3000

Known Issues

After installing this image in WSL, it cannot directly download models online and can only read locally installed models, but the conversation still fails after model selection. It is uncertain whether this is due to port mapping issues between WSL and the Windows host.
Installing Docker Desktop on the Windows host and then running this image works fine. If you want to run it with GPU, you may need to install the Conda environment.

Llama3 Model#

To install the model, you can first search on the Ollama official website.

The native Llama3 is primarily trained in English. Although you can force it to respond in Chinese using prompts, it is often ignored. Therefore, I recommend using some Llama3 models fine-tuned in Chinese.

You can search for llama3-chinese to find the Chinese fine-tuned versions.

Taking the model wangshenzhi/llama3-8b-chinese-chat-ollama-q8 that I used as an example.

Copy the command here and run it in the terminal.

In most cases, it is only recommended to install the 8B model; larger models are best installed on dedicated computing cards.

ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q8

ollama run indicates running the model. If the model is not downloaded, it will be downloaded first; if the model is already downloaded, it will run directly.

At this point, you can directly have a conversation in the terminal.

Framework Invocation#

Based on the previous section's learning, we can directly reuse the code, only needing to modify the model configuration.

private static OllamaChatClient getClient(){
        var ollamaApi = new OllamaApi();

        return new OllamaChatClient(ollamaApi).withDefaultOptions(OllamaOptions.create()
                        .withModel("wangshenzhi/llama3-8b-chinese-chat-ollama-q8")
                        .withTemperature(0.4f));
    }

Here, OllamaApi has a default baseUrl of http://localhost:11434. If your model is not deployed locally, you need to modify this address.
.withModel("wangshenzhi/llama3-8b-chinese-chat-ollama-q8") specifies the model; make sure to use the full name.

Other parts do not need to be changed; just run it directly.

Known issues with Llama3 are: 1. The model is too small, responses are often incorrect or hallucinated; 2. It seems not to support Function Calling.

The complete code is as follows:

/**
 * @author lza
 * @date 2024/04/22-10:31
 **/

@RestController
@RequestMapping("ollama")
@RequiredArgsConstructor
@CrossOrigin
public class OllamaController {


    private static final Integer MAX_MESSAGE = 10;

    private static Map<String, List<Message>> chatMessage = new ConcurrentHashMap<>();


    /**
     * Create OpenAiChatClient
     * @return OpenAiChatClient
     */
    private static OllamaChatClient getClient(){
        var ollamaApi = new OllamaApi();

        return new OllamaChatClient(ollamaApi).withDefaultOptions(OllamaOptions.create()
                        .withModel("wangshenzhi/llama3-8b-chinese-chat-ollama-q8")
                        .withTemperature(0.4f));
    }


    /**
     * Return prompt
     * @param message User input message
     * @return Prompt
     */
    private List<Message> getMessages(String id, String message) {
        String systemPrompt = "{prompt}";
        SystemPromptTemplate systemPromptTemplate = new SystemPromptTemplate(systemPrompt);

        Message userMessage = new UserMessage(message);

        Message systemMessage = systemPromptTemplate.createMessage(MapUtil.of("prompt", "you are a helpful AI assistant"));

        List<Message> messages = chatMessage.get(id);


        // If no messages are retrieved, create new messages and add the system prompt and user input to the message list
        if (messages == null){
            messages = new ArrayList<>();
            messages.add(systemMessage);
            messages.add(userMessage);
        } else {
            messages.add(userMessage);
        }

        return messages;
    }

    /**
     * Initialize function call
     * @return ChatOptions
     */
    private ChatOptions initFunc(){
        return OpenAiChatOptions.builder().withFunctionCallbacks(List.of(
                FunctionCallbackWrapper.builder(new MockWeatherService()).withName("weather").withDescription("Get the weather in location").build(),
                FunctionCallbackWrapper.builder(new WbHotService()).withName("wbHot").withDescription("Get the hot list of Weibo").build(),
                FunctionCallbackWrapper.builder(new TodayNews()).withName("todayNews").withDescription("60s watch world news").build(),
                FunctionCallbackWrapper.builder(new DailyEnglishFunc()).withName("dailyEnglish").withDescription("A daily inspirational sentence in English").build())).build();
    }

    /**
     * Create connection
     */
    @SneakyThrows
    @GetMapping("/init/{message}")
    public String init() {
        return String.valueOf(UUID.randomUUID());
    }

    @GetMapping("chat/{id}/{message}")
    public SseEmitter chat(@PathVariable String id, @PathVariable String message, HttpServletResponse response) {

        response.setHeader("Content-type", "text/html;charset=UTF-8");
        response.setCharacterEncoding("UTF-8");

        OllamaChatClient client = getClient();
        SseEmitter emitter = SseEmitterUtils.connect(id);
        List<Message> messages = getMessages(id, message);
        System.err.println("chatMessage size: " + messages.size());
        System.err.println("chatMessage: " + chatMessage);

        if (messages.size() > MAX_MESSAGE){
            SseEmitterUtils.sendMessage(id, "Too many conversation attempts, please try again later🤔");
        }else {
            // Get the model's output stream
            Flux<ChatResponse> stream = client.stream(new Prompt(messages));

            // Send the messages in the stream using SSE
            Mono<String> result = stream
                    .flatMap(it -> {
                        StringBuilder sb = new StringBuilder();
                        String content = it.getResult().getOutput().getContent();
                        Optional.ofNullable(content).ifPresent(r -> {
                            SseEmitterUtils.sendMessage(id, content);
                            sb.append(content);
                        });
                        return Mono.just(sb.toString());
                    })
                    // Concatenate the messages into a string
                    .reduce((a, b) -> a + b)
                    .defaultIfEmpty("");

            // Store the messages in chatMessage as AssistantMessage
            result.subscribe(finalContent -> messages.add(new AssistantMessage(finalContent)));

            // Store the messages in chatMessage
            chatMessage.put(id, messages);

        }
        return emitter;

    }

Free Resources#

Here are some free Llama3 models and APIs:

Groq from Musk's company, reportedly uses dedicated cards instead of NVIDIA, and is very fast.
NVIDIA offers many free models for use.
Cloudflare has many models available for deployment with free quotas.

Ollama
Local Deployment of Llama3 – 8B/70B Large Models! The Easiest Method: Supports CPU/GPU Running 【3 Solutions】
Easily Build Llama3 Web Interactive Interface - Ollama + Open WebUI
Quick and Easy Local Deployment of Llama3 Using Ollama + AnythingLLM
Ollama: Local Large Model Running Guide