senten包

2024-04-20 01:42:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个段落列表,我想从所有段落中删除停止词。你知道吗

我先把单词分开,然后用stopwords检查单词,如果不在stopwords中,就加上那个单词。它只适用于一个段落列表,但是当尝试整个段落列表时,它会创建一个包含所有段落的列表换言之按那个名单分组

g=[]
h=[]
for i in f[0:2]:
    word_token=npl.tokenize.word_tokenize(i)
    for j in word_token:
        if(j not in z):
            g.append(j)
        h.append(g)

示例

Y="'Take a low budget, inexperienced actors doubling as production staff\x97 as well as limited facilities\x97and you can\'t expect much more than "Time Chasers" gives you, but you can absolutely expect a lot less. This film represents a bunch of good natured friends and neighbors coming together to collaborate on an interesting project. If your cousin had been one of those involved, you would probably think to yourself, "ok, this movie is terrible... but a really good effort." For all the poorly delivered dialog and ham-fisted editing, "Time Chasers" has great scope and ambition... and one can imagine it was necessary to shoot every scene in only one or two takes. So, I\'m suggesting people cut "Time Chasers" some slack before they cut in the jugular. That said, I\'m not sure I can ever forgive the pseudo-old lady from the grocery store for the worst delivery every wrenched from the jaws of a problematic script.'"

z=set(npl.corpus.stopwords.words("english"))
x=[]
word_token=npl.tokenize.word_tokenize(y)
for i in word_token:
    if(i not in z):
        x.append(i)

print(np.array(x))       

输出

['Take' 'low' 'budget' ',' 'inexperienced' 'actors' 'doubling'
 'production' 'staff\x97' 'well' 'limited' 'facilities\x97and' 'ca' "n't"
 'expect' 'much' '``' 'Time' 'Chasers' "''" 'gives' ',' 'absolutely'
 'expect' 'lot' 'less' '.' 'This' 'film' 'represents' 'bunch' 'good'
 'natured' 'friends' 'neighbors' 'coming' 'together' 'collaborate'
 'interesting' 'project' '.' 'If' 'cousin' 'one' 'involved' ',' 'would'
 'probably' 'think' ',' '``' 'ok' ',' 'movie' 'terrible' '...' 'really'
 'good' 'effort' '.' "''" 'For' 'poorly' 'delivered' 'dialog' 'ham-fisted'
 'editing' ',' '``' 'Time' 'Chasers' "''" 'great' 'scope' 'ambition' '...'
 'one' 'imagine' 'necessary' 'shoot' 'every' 'scene' 'one' 'two' 'takes'
 '.' 'So' ',' 'I' "'m" 'suggesting' 'people' 'cut' '``' 'Time' 'Chasers'
 "''" 'slack' 'cut' 'jugular' '.' 'That' 'said' ',' 'I' "'m" 'sure' 'I'
 'ever' 'forgive' 'pseudo-old' 'lady' 'grocery' 'store' 'worst' 'delivery'
 'every' 'wrenched' 'jaws' 'problematic' 'script' '.']

像这样。我想要同样的输出作为段落列表


Tags: theintokenyou列表fortimeone
1条回答
网友
1楼 · 发布于 2024-04-20 01:42:28

给出一个列表:

doc_set = ['my name is omprakash', 'my name is rajesh']

执行:

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = RegexpTokenizer(r'\w+')
en_stop = set(stopwords.words('english'))

cleaned_texts = []

for i in doc_set:
    tokens = tokenizer.tokenize(i)
    stopped_tokens = [i for i in tokens if not i in en_stop]
    cleaned_texts.append(stopped_tokens)

输出:

[['name', 'omprakash'], ['name', 'rajesh']]

如果将它们放到数据帧中,可以看到:

import pandas as pd
df = pd.DataFrame()
df['unclean_text'] = doc_set
df['clean_text'] = cleaned_texts

输出:

                   text              clean
0  my name is omprakash  [name, omprakash]
1     my name is rajesh     [name, rajesh]

注:“我的”是一个停止词,因此它被排除在外

相关问题 更多 >