Spacy解析器在Pandas多处理数据帧中的应用

from multiprocessing import Pool, cpu_count def parallelize_dataframe(df, func, num_partitions): df_split = np.array_split(df, num_partitions) pool = Pool(num_partitions) df = pd.concat(pool.map(func, df_split)) pool.close() pool.join() return df

sepal_length species length_of_word species_parsed 0 5.1 setosa 6 () 1 4.9 setosa 6 () 2 4.7 setosa 6 ()

1条回答

网友

1楼 · 发布于 2024-04-20 13:05:31

Spacy是高度优化的，可以为您进行多处理。因此，我认为您最好的选择是将数据从Dataframe中取出并作为列表传递给Spacy管道，而不是尝试直接使用.apply。

然后需要整理解析结果，并将其放回数据帧中。

因此，在您的示例中，您可以使用以下内容：

tokens = []
lemma = []
pos = []

for doc in nlp.pipe(df['species'].astype('unicode').values, batch_size=50,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

df['species_tokens'] = tokens
df['species_lemma'] = lemma
df['species_pos'] = pos

这种方法在小数据集上可以很好地工作，但是它会消耗你的内存，所以如果你想处理大量的文本，就不太好了。

相关问题更多 >

编程相关推荐

热门问题

热门文章