如何用正则表达式去除推文中的话题、@用户和链接

23 投票
4 回答
64987 浏览
提问于 2025-04-17 07:40

我需要用Python来处理推文。现在我想知道,怎样写正则表达式才能分别去掉推文中的所有话题标签、@用户名和链接呢?

比如说,

  1. 原始推文: @peter 我真的很喜欢那件衬衫在 #Macy。 http://bet.ly//WjdiW4
    • 处理后的推文: 我真的很喜欢那件衬衫在 Macy
  2. 原始推文: @shawn 泰坦尼克号悲剧本可以避免 经济时报: Telegraph.co.uk泰坦尼克号悲剧本可以避免... http://bet.ly/tuN2wx
    • 处理后的推文: 泰坦尼克号悲剧本可以避免 经济时报 Telegraph co uk泰坦尼克号悲剧本可以避免
  3. 原始推文: 我在星巴克 http://4sh.com/samqUI (7419 3rd ave, 在75th, 布鲁克林)
    • 处理后的推文: 我在星巴克 7419 3rd ave 在75th 布鲁克林

我只需要每条推文中的有意义的词。我不需要用户名、链接或任何标点符号。

4 个回答

4

这个方法在你的例子中是有效的。如果你的推文里有链接,它就会失败,非常糟糕

result = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", subject)

补充:

只要内部链接之间有空格,这个方法也能用。

直接使用API吧,何必自己去发明轮子呢?

17

虽然有点晚了,但这个解决方案可以防止像#hashtag1,#hashtag2(没有空格)的标点错误,而且实现起来非常简单。

import re,string

def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

def strip_all_entities(text):
    entity_prefixes = ['@','#']
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                words.append(word)
    return ' '.join(words)


tests = [
    "@peter I really love that shirt at #Macy. http://bet.ly//WjdiW4",
    "@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx",
    "I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)",
]
for t in tests:
    strip_all_entities(strip_links(t))


#'I really love that shirt at'
#'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
#'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
41

下面这个例子是一个比较接近的做法。不过,遗憾的是,单靠正则表达式并没有一个完全正确的方法。这个正则表达式可以去掉网址(不仅仅是http),还会去掉标点符号、用户名或者任何非字母数字的字符。同时,它会把单词之间用一个空格分开。如果你想像你所想的那样解析推文,你需要系统更聪明一些。也就是说,需要一些预先学习的算法,因为推文的格式没有标准。

这是我提议的方案。

' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())

下面是你提供的例子的结果

>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>> 

这里有一些例子,说明它并不是完美的

>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>> 

撰写回答