Ollama内部有并行吗?

0 投票
1 回答
105 浏览
提问于 2025-04-13 17:42

我写了一个Python程序,用来把大量的英文文本翻译成法文。我做的就是通过一个循环,把一堆报告输入给Ollama

from functools import cached_property

from ollama import Client


class TestOllama:

    @cached_property
    def ollama_client(self) -> Client:
        return Client(host=f"http://127.0.0.1:11434")

    def translate(self, text_to_translate: str):
        ollama_response = self.ollama_client.generate(
            model="mistral",
            prompt=f"translate this French text into English: {text_to_translate}"
        )
        return ollama_response['response'].lstrip(), ollama_response['total_duration']

    def run(self):
        reports = ["reports_text_1", "reports_text_2"....] # avearge text size per report is between 750-1000 tokens.
        for each_report in reports:
            try:
                translated_report, total_duration = self.translate(
                    text_to_translate=each_report
                )
                print(f"Translated text:{translated_report}, Time taken:{total_duration}")
            except Exception as e:
                pass


if __name__ == '__main__':
    job = TestOllama()
    job.run()

运行Ollama的docker命令:

docker run -d --gpus=all --network=host --security-opt seccomp=unconfined -v report_translation_ollama:/root/.ollama --name ollama ollama/ollama

我想问的是,当我在V100和H100上运行这个脚本时,执行时间没有明显的差别。而且我避免了并行处理,因为我想Ollama可能内部已经使用了并行处理,但我在用htop命令查看时,只看到一个核心在工作。我这样理解对吗?

我在自然语言处理方面是个初学者;如果能得到一些帮助或指导,帮助我整理代码(比如:使用多线程来发送Ollama请求等等),我会非常感激。

1 个回答

0

根据我所知道的,到现在为止(2024年3月29日),ollama还不支持并行处理。

既然你有两块GPU,你可以尝试在不同的端口运行两个或更多(不推荐)ollama容器。这里有一个示例

好的,我会让我的回答更完整。以下是一些步骤供你参考:

  1. 从dockerhub拉取官方的ollama镜像: docker pull ollama/ollama:latest

  2. 使用以下命令运行第一个docker容器:

    docker run -d \
    --gpus=0 \
    -v {the_dir_u_save_models}:/root/.ollama \
    -p {port1}:11434 \
    --name ollama \
    ollama/ollama
    
  3. 使用以下命令运行第二个docker容器:

    docker run -d \
    --gpus=1 \
    -v {the_dir_u_save_models}:/root/.ollama \
    -p {port2}:11434 \
    --name ollama \
    ollama/ollama
    

    注意--gpus=-p参数的区别。

  4. 在Python或其他语言中使用异步来管理你的并行处理逻辑。以下是一些伪代码描述这个过程:

    function processQuestionsAsync(questions) {
        // Sort questions based on a specific criterion, e.g., LLM
        questions.sort((a, b) => {
            // Example: sorting by 'LLM'. Adjust comparison logic as needed.
            return a.LLM.localeCompare(b.LLM);
        });
    
        // Divide questions into two batches
        let batch1 = questions.slice(0, questions.length / 2);
        let batch2 = questions.slice(questions.length / 2);
    
        // Initialize an array to collect responses
        let responses = [];
    
        // Function to send questions to a server asynchronously
        async function sendToServer(batch, serverURL) {
            for (let question of batch) {
                let response = await sendQuestionToServer(question, serverURL);
                // Once a response is received, send it to the front-end
                sendResponseToFrontEnd(response);
                // Store response in the responses array
                responses.push(response);
            }
        }
    
        // Define server URLs for each batch
        let serverURL1 = "http://server1.com/api";
        let serverURL2 = "http://server2.com/api";
    
        // Use Promise.all to handle both batches in parallel
        Promise.all([
            sendToServer(batch1, serverURL1),
            sendToServer(batch2, serverURL2)
        ]).then(() => {
            // All questions have been processed, and all responses have been sent to the front-end
            console.log("All questions processed");
        }).catch(error => {
            // Handle errors
            console.error("An error occurred:", error);
        });
    }
    
    // Function to simulate sending a question to a server
    async function sendQuestionToServer(question, serverURL) {
        // Simulate network request
        let response = await fetch(serverURL, {
            method: 'POST',
            body: JSON.stringify(question),
            headers: {
                'Content-Type': 'application/json'
            },
        });
    
        // Return the response data
        return await response.json();
    }
    
    // Function to simulate sending a response back to the front-end
    function sendResponseToFrontEnd(response) {
        console.log("Sending response to front-end:", response);
    }
    
    // Example usage with questions as objects including 'contentType'
    let questions = [
        { question: "What is 2+2?", contentType: "Math", LLM: "GPT-3" },
        { question: "What is the capital of France?", contentType: "Geography", LLM: "GPT-3" },
        { question: "What is the largest ocean?", contentType: "Geography", LLM: "GPT-4" },
        { question: "What is the speed of light?", contentType: "Physics", LLM: "GPT-3" }
    ];
    processQuestionsAsync(questions);
    

上面的伪代码并不重要。最重要的是,你可能会收到来自前端的请求,要求不同的LLM,而一个ollama实例需要在它们之间进行加载和卸载。尽量避免这种情况。

顺便说一下,根据我在A100上使用ollama的经验,并行处理可能不是一个好的选择。显存确实很充足,但计算能力总是有限的。

撰写回答