提取并去除 http:// 和 www.，只保留 domain.com

14 投票

6 回答

25149 浏览

提问于 2025-04-17 14:18

我刚开始学习Python，想从一个包含网址的文件中提取出域名。

我日志文件里的网址有的以http://开头，有的以www.开头，还有的同时以这两者开头。

这是我代码中用来去掉http://部分的代码。我需要在这段代码中添加什么，才能同时去掉http和www呢？

line = re.findall(r'(https?://\S+)', line)

目前我运行代码时，只能去掉http://部分。如果我把代码改成下面这样：

line = re.findall(r'(https?://www.\S+)', line)

那么只有那些同时以这两者开头的域名会受到影响。我希望代码能更灵活一些。

谢谢大家！

补充一下……这是我的完整代码……

import re
import sys
from urlparse import urlparse

f = open(sys.argv[1], "r")

for line in f.readlines():
 line = re.findall(r'(https?://\S+)', line)
 if line:
  parsed=urlparse(line[0])
  print parsed.hostname
f.close()

我之前把帖子标记错了，其实是用的urlparse，而不是正则表达式。

正则表达式字符串处理编程技巧网络编程 URL解析数据清洗日志分析域名提取

6 个回答

我遇到了同样的问题。这是一个基于正则表达式的解决方案：

>>> import re
>>> rec = re.compile(r"https?://(www\.)?")

>>> rec.sub('', 'https://domain.com/bla/').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'https://domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'http://domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'http://www.domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

回答于 2025-04-17 由 Python大师

分享举报

对于这个特定的情况，可能有点过于复杂，但我一般会使用 urlparse.urlsplit（Python 2）或者 urllib.parse.urlsplit（Python 3）。

from urllib.parse import urlsplit  # Python 3
from urlparse import urlsplit  # Python 2
import re

url = 'www.python.org'

# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid

if not re.match(r'http(s?)\:', url):
    url = 'http://' + url

# url is now 'http://www.python.org'

parsed = urlsplit(url)

# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined

host = parsed.netloc  # www.python.org

# Removing www.
# This is a bad idea, because www.python.org could 
# resolve to something different than python.org

if host.startswith('www.'):
    host = host[4:]

回答于 2025-04-17 由 Python大师

分享举报

这里其实不需要用正则表达式。

with open("file_path","r") as f:
    lines = f.read()
    lines = lines.replace("http://","")
    lines = lines.replace("www.", "") # May replace some false positives ('www.com')
    urls = [url.split('/')[0] for url in lines.split()]
    print '\n'.join(urls)

示例文件输入：

http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com

输出结果：

foo.com
foobar.com
bar.com
foobar.com

补充说明：

可能会遇到一个比较棘手的网址，比如foobarwww.com，之前的方法会把www去掉。这样的话，我们就得重新使用正则表达式了。

把这一行 lines = lines.replace("www.", "") 替换成 lines = re.sub(r'(www.)(?!com)',r'',lines)。当然，所有可能的顶级域名（TLD）都应该用在不匹配的模式中。

回答于 2025-04-17 由 Python大师

分享举报

提取并去除 http:// 和 www.，只保留 domain.com

6 个回答

撰写回答