使用正则表达式提取域名

1 投票

4 回答

9100 浏览

提问于 2025-04-16 14:46

假设我有这些网址。

http://abdd.eesfea.domainname.com/b/33tA$/0021/file
http://mail.domainname.org/abc/abc/aaa
http://domainname.edu

我只想提取出“domainame.com”或者“domainname.org”或者“domainname.edu”。我该怎么做呢？

我觉得我需要找到最后一个“点”，也就是在“com|org|edu...”之前的那个点，然后把这个点前面的点到这个点后面的点之间的内容打印出来（如果有的话）。

我需要关于正则表达式的帮助。非常感谢！！！我在使用Python。

正则表达式字符串处理编程技巧数据解析网络技术域名提取

4 个回答

除了Jase的回答，还有其他方法。

如果你不想用urlparse这个工具，可以直接把网址拆开。

首先去掉协议部分（比如http://或https://）。然后你可以按照第一个出现的'/'来分割这个字符串。这样你就能得到类似'mail.domainname.org'的部分。接着再把这个部分用'.'分开，最后从这个列表中选出最后两个部分，用[-2]来获取。

这样你总是能得到像domainname.org这样的域名，只要你正确去掉了协议部分，并且网址是有效的。

我个人还是会用urlparse，但这样做也是可以的。我对正则表达式不太了解，不过这就是我会采取的方法。

回答于 2025-04-16 由 Python大师

分享举报

如果你想用正则表达式来处理这个问题……

RFC-3986 是关于URI（统一资源标识符）的权威标准。附录B提供了一个正则表达式，可以把URI拆分成各个部分：

re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme    = $2
# authority = $4
# path      = $5
# query     = $7
# fragment  = $9

这里有一个增强版的、适合Python使用的版本，它使用了命名捕获组。这个版本以一个函数的形式出现在一个完整的脚本中：

import re

def get_domain(url):
    """Return top two domain levels from URI"""
    re_3986_enhanced = re.compile(r"""
        # Parse and capture RFC-3986 Generic URI components.
        ^                                    # anchor to beginning of string
        (?:  (?P<scheme>    [^:/?#\s]+): )?  # capture optional scheme
        (?://(?P<authority>  [^/?#\s]*)  )?  # capture optional authority
             (?P<path>        [^?#\s]*)      # capture required path
        (?:\?(?P<query>        [^#\s]*)  )?  # capture optional query
        (?:\#(?P<fragment>      [^\s]*)  )?  # capture optional fragment
        $                                    # anchor to end of string
        """, re.MULTILINE | re.VERBOSE)
    re_domain =  re.compile(r"""
        # Pick out top two levels of DNS domain from authority.
        (?P<domain>[^.]+\.[A-Za-z]{2,6})  # $domain: top two domain levels.
        (?::[0-9]*)?                      # Optional port number.
        $                                 # Anchor to end of string.
        """, 
        re.MULTILINE | re.VERBOSE)
    result = ""
    m_uri = re_3986_enhanced.match(url)
    if m_uri and m_uri.group("authority"):
        auth = m_uri.group("authority")
        m_domain = re_domain.search(auth)
        if m_domain and m_domain.group("domain"):
            result = m_domain.group("domain");
    return result

data_list = [
    r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file",
    r"http://mail.domainname.org/abc/abc/aaa",
    r"http://domainname.edu",
    r"http://domainname.com:80",
    r"http://domainname.com?query=one",
    r"http://domainname.com#fragment",
    ]
cnt = 0
for data in data_list:
    cnt += 1
    print("Data[%d] domain = \"%s\"" %
        (cnt, get_domain(data)))

如果你想了解更多关于如何根据RFC-3986拆解和验证URI的信息，可以看看我正在写的一篇文章：正则表达式URI验证

回答于 2025-04-16 由 Python大师

分享举报

为什么要使用正则表达式？

http://docs.python.org/library/urlparse.html

回答于 2025-04-16 由 Python大师

分享举报

使用正则表达式提取域名

4 个回答

撰写回答