Ollama内部有并行吗?
我写了一个Python程序,用来把大量的英文文本翻译成法文。我做的就是通过一个循环,把一堆报告输入给Ollama。
from functools import cached_property
from ollama import Client
class TestOllama:
@cached_property
def ollama_client(self) -> Client:
return Client(host=f"http://127.0.0.1:11434")
def translate(self, text_to_translate: str):
ollama_response = self.ollama_client.generate(
model="mistral",
prompt=f"translate this French text into English: {text_to_translate}"
)
return ollama_response['response'].lstrip(), ollama_response['total_duration']
def run(self):
reports = ["reports_text_1", "reports_text_2"....] # avearge text size per report is between 750-1000 tokens.
for each_report in reports:
try:
translated_report, total_duration = self.translate(
text_to_translate=each_report
)
print(f"Translated text:{translated_report}, Time taken:{total_duration}")
except Exception as e:
pass
if __name__ == '__main__':
job = TestOllama()
job.run()
运行Ollama的docker命令:
docker run -d --gpus=all --network=host --security-opt seccomp=unconfined -v report_translation_ollama:/root/.ollama --name ollama ollama/ollama
我想问的是,当我在V100和H100上运行这个脚本时,执行时间没有明显的差别。而且我避免了并行处理,因为我想Ollama可能内部已经使用了并行处理,但我在用htop命令查看时,只看到一个核心在工作。我这样理解对吗?
我在自然语言处理方面是个初学者;如果能得到一些帮助或指导,帮助我整理代码(比如:使用多线程来发送Ollama请求等等),我会非常感激。
1 个回答
0
根据我所知道的,到现在为止(2024年3月29日),ollama还不支持并行处理。
既然你有两块GPU,你可以尝试在不同的端口运行两个或更多(不推荐)ollama容器。这里有一个示例。
好的,我会让我的回答更完整。以下是一些步骤供你参考:
从dockerhub拉取官方的ollama镜像:
docker pull ollama/ollama:latest
使用以下命令运行第一个docker容器:
docker run -d \ --gpus=0 \ -v {the_dir_u_save_models}:/root/.ollama \ -p {port1}:11434 \ --name ollama \ ollama/ollama
使用以下命令运行第二个docker容器:
docker run -d \ --gpus=1 \ -v {the_dir_u_save_models}:/root/.ollama \ -p {port2}:11434 \ --name ollama \ ollama/ollama
注意
--gpus=
和-p
参数的区别。在Python或其他语言中使用异步来管理你的并行处理逻辑。以下是一些伪代码描述这个过程:
function processQuestionsAsync(questions) { // Sort questions based on a specific criterion, e.g., LLM questions.sort((a, b) => { // Example: sorting by 'LLM'. Adjust comparison logic as needed. return a.LLM.localeCompare(b.LLM); }); // Divide questions into two batches let batch1 = questions.slice(0, questions.length / 2); let batch2 = questions.slice(questions.length / 2); // Initialize an array to collect responses let responses = []; // Function to send questions to a server asynchronously async function sendToServer(batch, serverURL) { for (let question of batch) { let response = await sendQuestionToServer(question, serverURL); // Once a response is received, send it to the front-end sendResponseToFrontEnd(response); // Store response in the responses array responses.push(response); } } // Define server URLs for each batch let serverURL1 = "http://server1.com/api"; let serverURL2 = "http://server2.com/api"; // Use Promise.all to handle both batches in parallel Promise.all([ sendToServer(batch1, serverURL1), sendToServer(batch2, serverURL2) ]).then(() => { // All questions have been processed, and all responses have been sent to the front-end console.log("All questions processed"); }).catch(error => { // Handle errors console.error("An error occurred:", error); }); } // Function to simulate sending a question to a server async function sendQuestionToServer(question, serverURL) { // Simulate network request let response = await fetch(serverURL, { method: 'POST', body: JSON.stringify(question), headers: { 'Content-Type': 'application/json' }, }); // Return the response data return await response.json(); } // Function to simulate sending a response back to the front-end function sendResponseToFrontEnd(response) { console.log("Sending response to front-end:", response); } // Example usage with questions as objects including 'contentType' let questions = [ { question: "What is 2+2?", contentType: "Math", LLM: "GPT-3" }, { question: "What is the capital of France?", contentType: "Geography", LLM: "GPT-3" }, { question: "What is the largest ocean?", contentType: "Geography", LLM: "GPT-4" }, { question: "What is the speed of light?", contentType: "Physics", LLM: "GPT-3" } ]; processQuestionsAsync(questions);
上面的伪代码并不重要。最重要的是,你可能会收到来自前端的请求,要求不同的LLM,而一个ollama实例需要在它们之间进行加载和卸载。尽量避免这种情况。
顺便说一下,根据我在A100上使用ollama的经验,并行处理可能不是一个好的选择。显存确实很充足,但计算能力总是有限的。