如何优化在pandas数据框中对列表进行循环的函数？

0 投票

2 回答

80 浏览

提问于 2025-04-14 16:11

我在一个pandas数据表上使用了一个函数，代码如下：

import spacy
from collections import Counter


# Load English language model
nlp = spacy.load("en_core_web_sm")

# Function to filter out only nouns from a list of words
def filter_nouns(words):
    SYMBOLS = '{}()[].,:;+-*/&|<>=~$1234567890#_%'
    filtered_nouns = []
    
    # Preprocess the text by removing symbols and splitting into words
    words = [word.translate({ord(SYM): None for SYM in SYMBOLS}).strip() for word in words.split()]
    
    # Process each word and filter only nouns
    filtered_nouns = [token.text for token in nlp(" ".join(words)) if token.pos_ == "NOUN"]
    
    return filtered_nouns



# Apply filtering logic to all rows in the 'NOTE' column
df['filtered_nouns'] = sf['NOTE'].apply(lambda x: filter_nouns(x))

我的数据集中有6400行，而df['NOTE']是一个很长的段落，它是从Oracle的CLOB数据类型转换过来的。

这个函数在处理5到10行数据时运行得很快，但处理6400行数据时就非常慢了。

有没有什么方法可以让它运行得更快呢？

数据框优化 pandas性能数据处理效率 CLOB数据类型循环函数优化

2 个回答

一个简单的方法是使用内置的 multiprocessing 模块。把数据分成多个部分，然后独立处理这些部分。
想了解更多细节和示例，可以查看文档。 https://docs.python.org/3/library/multiprocessing.html

回答于 2025-04-14 由 Python大师

分享举报

首先，你应该去掉函数中所有重复的部分。在这一行：

words = [word.translate({ord(SYM): None for SYM in SYMBOLS}).strip() for word in words.split()]

你每次翻译一个单词时都在重新建立翻译字典，而且对文本中的每个单词都调用一次翻译。其实，把这些操作各做一次会更有效率：

tr = str.maketrans('', '', SYMBOLS)
words = words.strip().translate(tr).split()

这样在我电脑上处理一个1000个单词的字符串时，速度能提高大约50倍。

接下来，你在每次调用nlp时都在连接所有单词。其实你只需要做一次：

text = ' '.join(words)
filtered_nouns = [token.text for token in nlp(text) if token.pos_ == "NOUN"]

不过要注意，你只是根据空格来分割，所以完全可以跳过这一步。总的来说：

def filter_nouns(text):
    SYMBOLS = '{}()[].,:;+-*/&|<>=~$1234567890#_%'
    tr = str.maketrans('', '', SYMBOLS)
    
    # Preprocess the text by removing symbols
    words = text.strip().translate(tr)
    
    # Process each word and filter only nouns
    filtered_nouns = [token.text for token in nlp(words) if token.pos_ == "NOUN"]
    
    return filtered_nouns

最后要注意，.apply(lambda x: filter_nouns(x))和.apply(filter_nouns)是一样的。

回答于 2025-04-14 由 Python大师

分享举报

如何优化在pandas数据框中对列表进行循环的函数？

2 个回答

撰写回答