在Python中提取电子邮件中的URL
感谢你提交的信息到我们的目录网站 ourdirectory.com 链接: http://myurlok.us 请点击下面的链接来确认你的提交。 http://www.ourdirectory.com/confirm.aspx?id=1247778154270076
Once we receive your comfirmation, your site will be included for process!
regards,
http://www.ourdirectory.com
Thank you!
应该很明显我需要提取哪个链接。
6 个回答
1
@OP,如果你的邮箱格式总是一样的,
f=open("emailfile")
for line in f:
if "confirm your submission" in line:
print f.next().strip()
f.close()
2
如果你要处理带有超链接的HTML邮件,可以使用HTMLParse这个库,这样会更简单快捷。
import HTMLParser
class parseLinks(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a':
for name, value in attrs:
if name == 'href':
print value
print self.get_starttag_text()
someHtmlContainingLinks = ""
linkParser = parseLinks()
linkParser.feed(someHtmlContainingLinks)
1
这个解决方案只有在源内容不是HTML的时候才有效。
def extractURL(self,fileName):
wordsInLine = []
tempWord = []
urlList = []
#open up the file containing the email
file = open(fileName)
for line in file:
#create a list that contains each word in each line
wordsInLine = line.split(' ')
#For each word try to split it with :
for word in wordsInLine:
tempWord = word.split(":")
#Check to see if the word is a URL
if len(tempWord) == 2:
if tempWord[0] == "http" or tempWord[0] == "https":
urlList.append(word)
file.close()
return urlList