Run Llama.cpp with Deepseek on Risc-V(Licheepi 4A)

This article explains how to run Llama.cpp on the RISC-V architecture of Licheepi 4A and use Deepseek models for local inference.

Overview

Llama.cpp is an efficient large language model inference library implemented in C/C++, which can run on resource-constrained devices. Deepseek is a high-performance open-source large language model. This tutorial will guide you on how to deploy these technologies on the RISC-V architecture of Licheepi 4A.

Environment Preparation

Licheepi 4A development board
RevyOS system
Necessary dependencies

Since the c910 series chips use XTheadVector(RVV0p7), while llama.cpp officially only supports RVV1p0, we must use OpenBLAS as a backend to utilize the RVV0p7 instruction set.

First, install the relevant dependencies:

sudo apt install pkg-config

Obtaining the Toolchain

Although the new GCC-14 has support for XTheadVector, some instructions do not correspond to RVV0p7. Forcibly using XTheadVector will result in compilation errors. Therefore, we need to use a toolchain for RVV0p7.

For this, you can download the source code of the T-Head toolchain and compile it yourself, but the overall compilation time on Licheepi 4A is quite long. Alternatively, you can use the xthead toolchain provided by PLCT ruyisdk. The method is as follows:

cd ~
wget https://mirror.iscas.ac.cn/ruyisdk/dist/RuyiSDK-20240222-T-Head-Sources-T-Head-2.8.0-HOST-riscv64-linux-gnu-riscv64-plctxthead-linux-gnu.tar.xz
tar -xvf RuyiSDK-20240222-T-Head-Sources-T-Head-2.8.0-HOST-riscv64-linux-gnu-riscv64-plctxthead-linux-gnu.tar.xz
cd RuyiSDK-20240222-T-Head-Sources-T-Head-2.8.0-HOST-riscv64-linux-gnu-riscv64-plctxthead-linux-gnu/bin
export PATH=$(pwd):$PATH

Compiling OpenBLAS

Support for RVV0p7 is currently only provided by OpenBLAS; other backends such as OpenBLIS and Llama.cpp GGUF only support RVV1p0.

cd ~
git clone https://github.com/OpenMathLib/OpenBLAS
cd OpenBLAS
make HOSTCC=gcc TARGET=C910V CC=riscv64-plctxthead-linux-gnu-gcc FC=riscv64-plctxthead-linux-gnu-gfortran
sudo make install PREFIX=/usr
sudo make install PREFIX=~/RuyiSDK-20240222-T-Head-Sources-T-Head-2.8.0-HOST-riscv64-linux-gnu-riscv64-plctxthead-linux-gnu/riscv64-plctxthead-linux-gnu/sysroot/usr

If you compiled the T-Head toolchain yourself, please replace riscv64-plctxthead-linux-gnu-gcc and riscv64-plctxthead-linux-gnu-gfortran with your own compiled toolchain.

Please note that if you use the toolchain provided by PLCT, the installation path for OpenBLAS must be under the sysroot path of the toolchain, i.e., RuyiSDK-20240222-T-Head-Sources-T-Head-2.8.0-HOST-riscv64-linux-gnu-riscv64-plctxthead-linux-gnu/riscv64-plctxthead-linux-gnu/sysroot/usr. If you use a self-compiled toolchain, you can install normally to /usr/local.

Compiling Llama.cpp

Llama.cpp needs to be defined to use the OpenBLAS backend:

cd ~
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Since the original version will default to calling the RVV1p0 instruction set, we need to modify the llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt file to change -march=rv64gcv to -march=rv64gcv0p7.

diff --git a/ggml/src/ggml-cpu/CMakeLists.txt b/ggml/src/ggml-cpu/CMakeLists.txt
index 98fd18e..0e6f302 100644
--- a/ggml/src/ggml-cpu/CMakeLists.txt
+++ b/ggml/src/ggml-cpu/CMakeLists.txt
@@ -306,7 +306,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
     elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "riscv64")
         message(STATUS "RISC-V detected")
         if (GGML_RVV)
-            list(APPEND ARCH_FLAGS -march=rv64gcv -mabi=lp64d)
+            list(APPEND ARCH_FLAGS -march=rv64gcv0p7 -mabi=lp64d)
         endif()
     else()
         message(STATUS "Unknown architecture")

CC=riscv64-plctxthead-linux-gnu-gcc FC=riscv64-plctxthead-linux-gnu-gfortran cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS

cmake --build build --config Release -j4

If you encounter the following issues:

Cannot find libopenblas.so: Please check if the installation path in the previous step is correct
Cannot find riscv64-plctxthead-linux-gnu-gcc and riscv64-plctxthead-linux-gnu-gfortran: Please check if the toolchain is correctly added to the PATH
Illegal instruction error during runtime: Please check if you are using the RVV0p7 toolchain and if you have correctly modified the CMakeLists.txt

Obtaining the Model

Directly download the DeepSeek-R1-Distill-Qwen-1.5B-GGUF Q2_K model, which is smaller and more suitable for Licheepi 4A.

More models can be found on Hugging Face.

The above are not the original DeepSeek-R1, but rather distilled models. The original DeepSeek has 671B parameters, which is 100-600 times the parameter count of the above models.

Running

Running Directly in CLI

cd ~
llama.cpp/build/bin/llama-cli -m DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf -t 4 --prompt '你好！' -no-cnv

Please replace [Your words] with your input. -t 4 means using 4 threads, which can be adjusted according to your needs. -m represents the model path, please adjust according to your model.

Interactive Running

cd ~
llama.cpp/build/bin/llama-server -m DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf -t 4

Please replace -m with your model path. -t 4 means using 4 threads, which can be adjusted according to your needs.

Interactive mode requires a client, with the default API located at 127.0.0.1:8080. The API uses an OpenAI API compatible format, and you can use any client you find for interaction.

Here's an example client program in Python:

import requests

start_mask = ""
system_mask = ""
user_mask = "你好！"
assistant_mask = ""

class Message:
    
    SYSTEM = 1
    USER = 2
    ASSISTANT = 3

    role: int
    text: str

    def __init__(self, text, role = 2):
        self.text = text
        self.role = role

    def __str__(self):
        res = ""
        if self.role == Message.SYSTEM:
            res += system_mask
        elif self.role == Message.USER:
            res += user_mask
        elif self.role == Message.ASSISTANT:
            res += assistant_mask
        res += self.text
        res += '\n'
        return res
    

class Conversation:
    message: list[Message]
    url: str

    def __init__(self, system_promote = None, url = "http://127.0.0.1:8080/completion"):
        self.message = [] if system_promote is None else [
            Message(system_promote, Message.SYSTEM)
        ]
        self.url = url

    def __post_chat__(self):
        prompt = start_mask + "\n"
        for msg in self.message:
            prompt += str(msg) + "\n"
        prompt += assistant_mask
        res = requests.post(self.url, json={"prompt": prompt})
        msg = res.json()
        self.message.append(Message(msg["content"], Message.SYSTEM))
        return msg["content"]
    
    def chat(self, text):
        if len(text) > 0 :
            self.message.append(Message(text, Message.USER))
        return self.__post_chat__()
    
def main():
    conv = Conversation()
    while True:
        text = input("You: ")
        if text == "exit":
            break
        print("Assistant:", conv.chat(text))

if __name__ == "__main__":
    main()

Enter exit to quit.

You may notice that the interactive client takes a long time to respond; this is because DeepSeek undergoes extensive thinking processes before outputting all the content at once. Please be patient.

cd ~
python3 -m venv venv
source venv/bin/activate
python3 client.py

Reference document: https://github.com/wychlw/plct/blob/main/memo/deepseek_on_llama.cpp.md

Overview​

Environment Preparation​

Obtaining the Toolchain​

Compiling OpenBLAS​

Compiling Llama.cpp​

Obtaining the Model​

Running​

Running Directly in CLI​

Interactive Running​