在Python中提取电子邮件中的URL

1 投票
6 回答
10410 浏览
提问于 2025-04-15 16:21

感谢你提交的信息到我们的目录网站 ourdirectory.com 链接: http://myurlok.us 请点击下面的链接来确认你的提交。 http://www.ourdirectory.com/confirm.aspx?id=1247778154270076

Once we receive your comfirmation, your site will be included for process!
regards,

http://www.ourdirectory.com

Thank you!

应该很明显我需要提取哪个链接。

6 个回答

1

@OP,如果你的邮箱格式总是一样的,

f=open("emailfile")
for line in f:
    if "confirm your submission" in line:
        print f.next().strip()        
f.close()
2

如果你要处理带有超链接的HTML邮件,可以使用HTMLParse这个库,这样会更简单快捷。

import HTMLParser
class parseLinks(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    print value
                    print self.get_starttag_text()

someHtmlContainingLinks = ""
linkParser = parseLinks()
linkParser.feed(someHtmlContainingLinks)
1

这个解决方案只有在源内容不是HTML的时候才有效。

def extractURL(self,fileName):

    wordsInLine = []
    tempWord = []
    urlList = []

    #open up the file containing the email
    file = open(fileName)
    for line in file:
        #create a list that contains each word in each line
        wordsInLine = line.split(' ')
        #For each word try to split it with :
        for word in wordsInLine:
            tempWord = word.split(":")
            #Check to see if the word is a URL
            if len(tempWord) == 2:
                if tempWord[0] == "http" or tempWord[0] == "https":
                    urlList.append(word)

    file.close()

    return urlList

撰写回答