如何用正则表达式去除推文中的话题、@用户和链接

23 投票

4 回答

64987 浏览

数据工程师

提问于 2025-04-17 07:40

我需要用Python来处理推文。现在我想知道，怎样写正则表达式才能分别去掉推文中的所有话题标签、@用户名和链接呢？

比如说，

原始推文: @peter 我真的很喜欢那件衬衫在 #Macy。 http://bet.ly//WjdiW4
- 处理后的推文: 我真的很喜欢那件衬衫在 Macy
原始推文: @shawn 泰坦尼克号悲剧本可以避免经济时报: Telegraph.co.uk泰坦尼克号悲剧本可以避免... http://bet.ly/tuN2wx
- 处理后的推文: 泰坦尼克号悲剧本可以避免经济时报 Telegraph co uk泰坦尼克号悲剧本可以避免
原始推文: 我在星巴克 http://4sh.com/samqUI (7419 3rd ave, 在75th, 布鲁克林)
- 处理后的推文: 我在星巴克 7419 3rd ave 在75th 布鲁克林

我只需要每条推文中的有意义的词。我不需要用户名、链接或任何标点符号。

正则表达式文本处理字符串操作数据清洗信息提取自然语言处理推文分析

4 个回答

~~这个方法在你的例子中是有效的。如果你的推文里有链接，它就会失败，非常糟糕。~~

result = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", subject)

补充：

只要内部链接之间有空格，这个方法也能用。

直接使用API吧，何必自己去发明轮子呢？

回答于 2025-04-17 由 Python大师

分享举报

虽然有点晚了，但这个解决方案可以防止像#hashtag1,#hashtag2（没有空格）的标点错误，而且实现起来非常简单。

import re,string

def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

def strip_all_entities(text):
    entity_prefixes = ['@','#']
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                words.append(word)
    return ' '.join(words)


tests = [
    "@peter I really love that shirt at #Macy. http://bet.ly//WjdiW4",
    "@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx",
    "I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)",
]
for t in tests:
    strip_all_entities(strip_links(t))


#'I really love that shirt at'
#'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
#'I am at Starbucks 7419 3rd ave at 75th Brooklyn'

回答于 2025-04-17 由 Python大师

分享举报

下面这个例子是一个比较接近的做法。不过，遗憾的是，单靠正则表达式并没有一个完全正确的方法。这个正则表达式可以去掉网址（不仅仅是http），还会去掉标点符号、用户名或者任何非字母数字的字符。同时，它会把单词之间用一个空格分开。如果你想像你所想的那样解析推文，你需要系统更聪明一些。也就是说，需要一些预先学习的算法，因为推文的格式没有标准。

这是我提议的方案。

' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())

下面是你提供的例子的结果

>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>>

这里有一些例子，说明它并不是完美的

>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>>

回答于 2025-04-17 由 Python大师

分享举报

如何用正则表达式去除推文中的话题、@用户和链接

4 个回答

撰写回答