删除数据集中的停止字

2024-06-17 12:07:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图删除apadas数据集中的stopwords,其中每一行都有一个标记化的单词列表, 单词列表的格式如下:

['Uno', ',', 'dos', 'One', ',', 'two', ',', 'tres', ',', 'quatro', 'Yes', ',', 'Wooly', 'Bully', 'Watch', 'it', 'now', ',', 'watch', 'it', 'Here', 'he', 'come', ',', 'here', 'he', 'come', 'Watch', 'it', 'now', ',', 'he', 'git', 'ya', 'Matty', 'told', 'Hattie', 'about', 'a', 'thing', 'she', 'saw', 'Had', 'two', 'big', 'horns', 'and', 'a', 'wooly', 'jaw', 'Wooly', 'Bully', ',', 'Wooly', 'Bully', ',', 'yes', 'drive', 'Wooly', 'Bully', ',', 'Wooly', 'Bully', ',', 'Wooly', 'Bully', 'Hattie', 'told', 'Matty', '``', 'Let', "'s", 'do', "n't", 'take', 'no', 'chance', 'Let', "'s", 'not', 'be', 'L-seven', ',', 'come', 'and', 'learn', 'to', 'dance', "''", 'Wooly', 'Bully', ',', 'Wooly', 'Bully', 'Wooly', 'Bully', ',', 'Wooly', 'Bully', ',', 'Wooly', 'Bully', 'Watch', 'it', 'now', ',', 'watch', 'it', ',', 'watch', 'it', ',', 'watch', 'it', 'Yeah', 'Yeah', ',', 'drive', ',', 'drive', ',', 'drive', 'Matty', 'told', 'Hattie', '``', 'That', "'s", 'the', 'thing', 'to', 'do', 'Get', 'you', 'someone', 'really', 'pull', 'the', 'wool', 'with', 'you', "''", 'Wooly', 'Bully', ',', 'Wooly', 'Bully', 'Wooly', 'Bully', ',', 'Wooly', 'Bully', ',', 'Wooly', 'Bully', 'Watch', 'it', 'now', ',', 'watch', 'it', ',', 'here', 'he', 'come', 'You', 'got', 'it', ',', 'you', 'got', 'it']

为此,我使用以下代码。
ret = df['tokenized_lyric'].apply(lambda x: [item for item in x if item.lower() not in stops])

print(ret)

这使我得到如下列表

e0       [n,  ,  ,  , n, e,  ,  , w,  ,  , r, e,  ,  , ...
2165    [ , n, r,  ,  , r,  , r,  , l,  , p, r,  ,  , ...

似乎删除了几乎所有的字符。 我如何让它只删除我设置的停止字


Tags: you列表itdriveitemnowwatchhe
2条回答
from nltk.corpus import stopwords

# stop words from nltk library
stopwords = stopwords.words('english')

# user defined stop words
custom_stopwords = ['hey', 'hello'] 

# complete list of stop words
complete_stopwords = stopwords + custom_stopwords

# 
df['lyrics_clean'] = df['lyrics'].apply(lambda x: [word for word in x.split() if word not in (complete_stopwords)])

您正在使用列表遍历字符串的字符。相反,在lower()之后,使用split()拆分字符串,然后在工作标记上迭代,如下所示-

print([i for i in 'hi there']) #iterating over the characters
print([i for i in 'hi there'.split()]) #iterating over the words

['h', 'i', ' ', 't', 'h', 'e', 'r', 'e']
['hi', 'there']

试试这个lambda函数-

s = 'Hello World And Underworld'

stops = ['and','or','the']

f = lambda x: [item for item in x.split() if item.lower() not in stops]
f(s)
['hello', 'world', 'underworld']

W.r.t你的代码是-

df['tokenized_lyric'].apply(lambda x: [item for item in x.split() if item.lower() not in stops])

相关问题 更多 >