如何在清除Yelp评论中的文本后从数据框中删除空白词

2024-06-17 15:46:26 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我用来清理文本文件的方法:

# reference : https://github.com/GongtingPeng/Spark
# remove punctuation
def remove_punct(text):
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')
    nopunct = regex.sub("", text) 
    return nopunct
    
# binarize rating
def convert_rating(rating):
    rating = int(rating)
    if rating >=4: return 1
    else: return 0

# udf
punct_remover = udf(lambda x: remove_punct(x))
rating_convert = udf(lambda x: convert_rating(x))

# apply to review raw data
review_df = review.select('review_id', punct_remover('text'), rating_convert('stars'))

review_df = review_df.withColumnRenamed('<lambda>(text)', 'text')\
                     .withColumn('label', review_df["<lambda>(stars)"].cast(IntegerType()))\
                     .drop('<lambda>(stars)')\
                     .limit(1000000)
review_df.show(5)

这就是我用来删除stopwords的方法:

tok = Tokenizer(inputCol="text", outputCol="words")
review_tokenized = tok.transform(review_df)

# remove stop words
stopword_rm = StopWordsRemover(inputCol='words', outputCol='words_nsw')
review_tokenized = stopword_rm.transform(review_tokenized)

review_tokenized.show(5)

但在分解单词后,我仍然得到空格的值计数:

dfwords_exploded = review_tokenized.withColumn('words',explode('words_nsw'))
dfwords_exploded.show(50)

enter image description here

当我返回单词的计数时,空格是最高计数,我想删除它,所以我只计算实际单词:

enter image description here

我想问题出在我清理文本文件的初始代码的正则表达式中,但我不确定在哪里,这需要花费相当长的时间来运行,因此任何帮助都将不胜感激