从文本中移除付费墙语言（pandas）

-1 投票

1 回答

78 浏览

数据工程师

提问于 2025-04-14 15:54

我正在对我的数据集进行一些预处理。具体来说，我想从文本中去掉一些付费墙的内容（下面用粗体标出），但我得到的输出总是空字符串。

这是一个示例文本：

为了阻止侵入性的灌木蜜榆或马阿基忍冬，目前正在占领密苏里州和堪萨斯州的森林，来自埃克塞尔西尔斯普林斯的德比·内夫组织了一场……优质内容仅对订阅者开放。请在这里登录以访问内容，或者去这里购买订阅。

这是我自定义的函数：

import re
import string
import nltk
from nltk.corpus import stopwords

# function to detect paywall-related text
def detect_paywall(text):
    paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]
    for keyword in paywall_keywords:
        if re.search(r'\b{}\b'.format(keyword), text, flags=re.IGNORECASE):
            return True
    return False

# function for text preprocessing
def preprocess_text(text):
    # Check if the text contains paywall-related content
    if detect_paywall(text):
        # Remove paywall-related sentences or language from the text
        sentences = nltk.sent_tokenize(text)
        cleaned_sentences = [sentence for sentence in sentences if not detect_paywall(sentence)]
        cleaned_text = ' '.join(cleaned_sentences)
        return cleaned_text.strip()  # Remove leading/trailing whitespace

    # Tokenization
    tokens = nltk.word_tokenize(text)
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in stripped if word.isalpha() and word not in stop_words]
    return ' '.join(words)

我尝试修改要检测的单词列表，但没有成功。不过，我发现去掉“subscribers”这个词确实能去掉付费墙内容的第二句话。但这并不是理想的解决办法，因为还有其他部分没有去掉。

这个函数也不太稳定，因为它在这段文本上有效（可以去掉付费墙内容），但在上面的文本上却不行。

在成千上万的高中摔跤手中，只有一小部分人知道赢得州冠军是什么感觉。这个人就是其中之一。里士满的这位大三学生通过赢得……优质内容仅对订阅者开放。请在这里登录以访问内容，或者去这里购买订阅。

数据清洗自定义函数 pandas库文本分析文本预处理付费墙内容关键字检测内容过滤

1 个回答

这个方法通过以下步骤避免使用 for 循环：

首先把 text 拆分成 phrases（也就是句子的列表），
然后一次性用正则表达式 filter 来筛选所有的 keywords，
最后把 text 重新组合起来，去掉那些包含至少一个 keywords 的句子。

目前这个方法忽略了加粗的格式，并且使用了简单的 str.split()，而不是正则表达式的 re.split() 或者 nltk，所以它无法在 '...' 这个三点符号上进行拆分。

输入内容为：

import re

text = "Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. {{Elided}} is part of that percentage. The Richmond junior joined that group by winning… Premium Content is available to subscribers only. Please login here to access content or go here to purchase a subscription."
paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]

过滤器的模式：

patt = re.compile('|'.join(['.*' + k for k in paywall_keywords]))

'.*login|.*subscription|.*purchase a subscription|.*subscribers'

按句子拆分文本：

phrases = text.split(sep='.')

['Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title',
 ' {{Elided}} is part of that percentage',
 ' The Richmond junior joined that group by winning… Premium Content is available to subscribers only',
 ' Please login here to access content or go here to purchase a subscription',
 '']

查找匹配：

found = list(filter(patt.match, phrases))

[' The Richmond junior joined that group by winning… Premium Content is available to subscribers only',
 ' Please login here to access content or go here to purchase a subscription']

去掉这些句子并重新整理文本：

'.'.join([p for p in phrases if p not in found])

'Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. {{Elided}} is part of that percentage.'

参考资料：

回答于 2025-04-14 由 Python大师

分享举报

从文本中移除付费墙语言（pandas）

1 个回答

撰写回答