Ollama内部有并行吗？

0 投票

1 回答

105 浏览

提问于 2025-04-13 17:42

我写了一个Python程序，用来把大量的英文文本翻译成法文。我做的就是通过一个循环，把一堆报告输入给Ollama。

from functools import cached_property

from ollama import Client


class TestOllama:

    @cached_property
    def ollama_client(self) -> Client:
        return Client(host=f"http://127.0.0.1:11434")

    def translate(self, text_to_translate: str):
        ollama_response = self.ollama_client.generate(
            model="mistral",
            prompt=f"translate this French text into English: {text_to_translate}"
        )
        return ollama_response['response'].lstrip(), ollama_response['total_duration']

    def run(self):
        reports = ["reports_text_1", "reports_text_2"....] # avearge text size per report is between 750-1000 tokens.
        for each_report in reports:
            try:
                translated_report, total_duration = self.translate(
                    text_to_translate=each_report
                )
                print(f"Translated text:{translated_report}, Time taken:{total_duration}")
            except Exception as e:
                pass


if __name__ == '__main__':
    job = TestOllama()
    job.run()

运行Ollama的docker命令：

docker run -d --gpus=all --network=host --security-opt seccomp=unconfined -v report_translation_ollama:/root/.ollama --name ollama ollama/ollama

我想问的是，当我在V100和H100上运行这个脚本时，执行时间没有明显的差别。而且我避免了并行处理，因为我想Ollama可能内部已经使用了并行处理，但我在用htop命令查看时，只看到一个核心在工作。我这样理解对吗？

我在自然语言处理方面是个初学者；如果能得到一些帮助或指导，帮助我整理代码（比如：使用多线程来发送Ollama请求等等），我会非常感激。

代码优化多线程自然语言处理文本翻译 docker 并行处理 V100 H100

1 个回答

根据我所知道的，到现在为止（2024年3月29日），ollama还不支持并行处理。

既然你有两块GPU，你可以尝试在不同的端口运行两个或更多（不推荐）ollama容器。这里有一个示例。

好的，我会让我的回答更完整。以下是一些步骤供你参考：

从dockerhub拉取官方的ollama镜像： docker pull ollama/ollama:latest

使用以下命令运行第一个docker容器：

docker run -d \
--gpus=0 \
-v {the_dir_u_save_models}:/root/.ollama \
-p {port1}:11434 \
--name ollama \
ollama/ollama

使用以下命令运行第二个docker容器：

docker run -d \
--gpus=1 \
-v {the_dir_u_save_models}:/root/.ollama \
-p {port2}:11434 \
--name ollama \
ollama/ollama

注意--gpus=和-p参数的区别。

在Python或其他语言中使用异步来管理你的并行处理逻辑。以下是一些伪代码描述这个过程：

function processQuestionsAsync(questions) {
    // Sort questions based on a specific criterion, e.g., LLM
    questions.sort((a, b) => {
        // Example: sorting by 'LLM'. Adjust comparison logic as needed.
        return a.LLM.localeCompare(b.LLM);
    });

    // Divide questions into two batches
    let batch1 = questions.slice(0, questions.length / 2);
    let batch2 = questions.slice(questions.length / 2);

    // Initialize an array to collect responses
    let responses = [];

    // Function to send questions to a server asynchronously
    async function sendToServer(batch, serverURL) {
        for (let question of batch) {
            let response = await sendQuestionToServer(question, serverURL);
            // Once a response is received, send it to the front-end
            sendResponseToFrontEnd(response);
            // Store response in the responses array
            responses.push(response);
        }
    }

    // Define server URLs for each batch
    let serverURL1 = "http://server1.com/api";
    let serverURL2 = "http://server2.com/api";

    // Use Promise.all to handle both batches in parallel
    Promise.all([
        sendToServer(batch1, serverURL1),
        sendToServer(batch2, serverURL2)
    ]).then(() => {
        // All questions have been processed, and all responses have been sent to the front-end
        console.log("All questions processed");
    }).catch(error => {
        // Handle errors
        console.error("An error occurred:", error);
    });
}

// Function to simulate sending a question to a server
async function sendQuestionToServer(question, serverURL) {
    // Simulate network request
    let response = await fetch(serverURL, {
        method: 'POST',
        body: JSON.stringify(question),
        headers: {
            'Content-Type': 'application/json'
        },
    });

    // Return the response data
    return await response.json();
}

// Function to simulate sending a response back to the front-end
function sendResponseToFrontEnd(response) {
    console.log("Sending response to front-end:", response);
}

// Example usage with questions as objects including 'contentType'
let questions = [
    { question: "What is 2+2?", contentType: "Math", LLM: "GPT-3" },
    { question: "What is the capital of France?", contentType: "Geography", LLM: "GPT-3" },
    { question: "What is the largest ocean?", contentType: "Geography", LLM: "GPT-4" },
    { question: "What is the speed of light?", contentType: "Physics", LLM: "GPT-3" }
];
processQuestionsAsync(questions);

上面的伪代码并不重要。最重要的是，你可能会收到来自前端的请求，要求不同的LLM，而一个ollama实例需要在它们之间进行加载和卸载。尽量避免这种情况。

顺便说一下，根据我在A100上使用ollama的经验，并行处理可能不是一个好的选择。显存确实很充足，但计算能力总是有限的。

回答于 2025-04-13 由 Python大师

分享举报

Ollama内部有并行吗？

1 个回答

撰写回答