如何用正则表达式去除推文中的话题、@用户和链接
我需要用Python来处理推文。现在我想知道,怎样写正则表达式才能分别去掉推文中的所有话题标签、@用户名和链接呢?
比如说,
原始推文: @peter 我真的很喜欢那件衬衫在 #Macy。 http://bet.ly//WjdiW4
- 处理后的推文:
我真的很喜欢那件衬衫在 Macy
- 处理后的推文:
- 原始推文:
@shawn 泰坦尼克号悲剧本可以避免 经济时报: Telegraph.co.uk泰坦尼克号悲剧本可以避免... http://bet.ly/tuN2wx
- 处理后的推文:
泰坦尼克号悲剧本可以避免 经济时报 Telegraph co uk泰坦尼克号悲剧本可以避免
- 处理后的推文:
- 原始推文:
我在星巴克 http://4sh.com/samqUI (7419 3rd ave, 在75th, 布鲁克林)
- 处理后的推文:
我在星巴克 7419 3rd ave 在75th 布鲁克林
- 处理后的推文:
我只需要每条推文中的有意义的词。我不需要用户名、链接或任何标点符号。
4 个回答
4
这个方法在你的例子中是有效的。如果你的推文里有链接,它就会失败,非常糟糕。
result = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", subject)
补充:
只要内部链接之间有空格,这个方法也能用。
直接使用API吧,何必自己去发明轮子呢?
17
虽然有点晚了,但这个解决方案可以防止像#hashtag1,#hashtag2(没有空格)的标点错误,而且实现起来非常简单。
import re,string
def strip_links(text):
link_regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
links = re.findall(link_regex, text)
for link in links:
text = text.replace(link[0], ', ')
return text
def strip_all_entities(text):
entity_prefixes = ['@','#']
for separator in string.punctuation:
if separator not in entity_prefixes :
text = text.replace(separator,' ')
words = []
for word in text.split():
word = word.strip()
if word:
if word[0] not in entity_prefixes:
words.append(word)
return ' '.join(words)
tests = [
"@peter I really love that shirt at #Macy. http://bet.ly//WjdiW4",
"@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx",
"I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)",
]
for t in tests:
strip_all_entities(strip_links(t))
#'I really love that shirt at'
#'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
#'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
41
下面这个例子是一个比较接近的做法。不过,遗憾的是,单靠正则表达式并没有一个完全正确的方法。这个正则表达式可以去掉网址(不仅仅是http),还会去掉标点符号、用户名或者任何非字母数字的字符。同时,它会把单词之间用一个空格分开。如果你想像你所想的那样解析推文,你需要系统更聪明一些。也就是说,需要一些预先学习的算法,因为推文的格式没有标准。
这是我提议的方案。
' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
下面是你提供的例子的结果
>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>>
这里有一些例子,说明它并不是完美的
>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>>