从文本中移除付费墙语言(pandas)

-1 投票
1 回答
78 浏览
提问于 2025-04-14 15:54

我正在对我的数据集进行一些预处理。具体来说,我想从文本中去掉一些付费墙的内容(下面用粗体标出),但我得到的输出总是空字符串。

这是一个示例文本:

为了阻止侵入性的灌木蜜榆或马阿基忍冬,目前正在占领密苏里州和堪萨斯州的森林,来自埃克塞尔西尔斯普林斯的德比·内夫组织了一场……优质内容仅对订阅者开放。请在这里登录以访问内容,或者去这里购买订阅。

这是我自定义的函数:

import re
import string
import nltk
from nltk.corpus import stopwords

# function to detect paywall-related text
def detect_paywall(text):
    paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]
    for keyword in paywall_keywords:
        if re.search(r'\b{}\b'.format(keyword), text, flags=re.IGNORECASE):
            return True
    return False

# function for text preprocessing
def preprocess_text(text):
    # Check if the text contains paywall-related content
    if detect_paywall(text):
        # Remove paywall-related sentences or language from the text
        sentences = nltk.sent_tokenize(text)
        cleaned_sentences = [sentence for sentence in sentences if not detect_paywall(sentence)]
        cleaned_text = ' '.join(cleaned_sentences)
        return cleaned_text.strip()  # Remove leading/trailing whitespace

    # Tokenization
    tokens = nltk.word_tokenize(text)
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in stripped if word.isalpha() and word not in stop_words]
    return ' '.join(words)

我尝试修改要检测的单词列表,但没有成功。不过,我发现去掉“subscribers”这个词确实能去掉付费墙内容的第二句话。但这并不是理想的解决办法,因为还有其他部分没有去掉。

这个函数也不太稳定,因为它在这段文本上有效(可以去掉付费墙内容),但在上面的文本上却不行。

在成千上万的高中摔跤手中,只有一小部分人知道赢得州冠军是什么感觉。这个人就是其中之一。里士满的这位大三学生通过赢得……优质内容仅对订阅者开放。请在这里登录以访问内容,或者去这里购买订阅。

1 个回答

3

这个方法通过以下步骤避免使用 for 循环:

  • 首先把 text 拆分成 phrases(也就是句子的列表),
  • 然后一次性用正则表达式 filter 来筛选所有的 keywords
  • 最后把 text 重新组合起来,去掉那些包含至少一个 keywords 的句子。

目前这个方法忽略了 加粗 的格式,并且使用了简单的 str.split(),而不是正则表达式的 re.split() 或者 nltk,所以它无法在 '...' 这个三点符号上进行拆分。

输入内容为:

import re

text = "Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. {{Elided}} is part of that percentage. The Richmond junior joined that group by winning… Premium Content is available to subscribers only. Please login here to access content or go here to purchase a subscription."
paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]

过滤器的模式:

patt = re.compile('|'.join(['.*' + k for k in paywall_keywords]))

'.*login|.*subscription|.*purchase a subscription|.*subscribers'

按句子拆分文本:

phrases = text.split(sep='.')

['Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title',
 ' {{Elided}} is part of that percentage',
 ' The Richmond junior joined that group by winning… Premium Content is available to subscribers only',
 ' Please login here to access content or go here to purchase a subscription',
 '']

查找匹配:

found = list(filter(patt.match, phrases))

[' The Richmond junior joined that group by winning… Premium Content is available to subscribers only',
 ' Please login here to access content or go here to purchase a subscription']

去掉这些句子并重新整理文本:

'.'.join([p for p in phrases if p not in found])

'Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. {{Elided}} is part of that percentage.'

参考资料:

撰写回答