从域名提取二级域名？- Python

8 投票

6 回答

4727 浏览

数据工程师

提问于 2025-04-16 11:21

我有一份域名列表，比如：

site.co.uk
site.com
site.me.uk
site.jpn.com
site.org.uk
site.it

这些域名还可能包含三级和四级域名，比如：

test.example.site.org.uk
test2.site.com

我需要提取出这些域名的二级域名，在所有这些情况下，二级域名都是 site

有什么好主意吗？ :)

数据处理域名解析字符串操作网络技术域名提取二级域名三级域名四级域名

6 个回答

使用 Python 的 tld 库

https://pypi.python.org/pypi/tld

你可以通过下面的命令来安装 tld 库：

$ pip install tld

from tld import get_tld, get_fld

print(get_tld("http://www.google.co.uk"))
'co.uk'

print(get_fld("http://www.google.co.uk"))
'google.co.uk'

回答于 2025-04-16 由 Python大师

分享举报

根据@kohlehydrat的建议：

import urllib2

class TldMatcher(object):
    # use class vars for lazy loading
    MASTERURL = "http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1"
    TLDS = None

    @classmethod
    def loadTlds(cls, url=None):
        url = url or cls.MASTERURL

        # grab master list
        lines = urllib2.urlopen(url).readlines()

        # strip comments and blank lines
        lines = [ln for ln in (ln.strip() for ln in lines) if len(ln) and ln[:2]!='//']

        cls.TLDS = set(lines)

    def __init__(self):
        if TldMatcher.TLDS is None:
            TldMatcher.loadTlds()

    def getTld(self, url):
        best_match = None
        chunks = url.split('.')

        for start in range(len(chunks)-1, -1, -1):
            test = '.'.join(chunks[start:])
            startest = '.'.join(['*']+chunks[start+1:])

            if test in TldMatcher.TLDS or startest in TldMatcher.TLDS:
                best_match = test

        return best_match

    def get2ld(self, url):
        urls = url.split('.')
        tlds = self.getTld(url).split('.')
        return urls[-1 - len(tlds)]


def test_TldMatcher():
    matcher = TldMatcher()

    test_urls = [
        'site.co.uk',
        'site.com',
        'site.me.uk',
        'site.jpn.com',
        'site.org.uk',
        'site.it'
    ]

    errors = 0
    for u in test_urls:
        res = matcher.get2ld(u)
        if res != 'site':
            print "Error: found '{0}', should be 'site'".format(res)
            errors += 1

    if errors==0:
        print "Passed!"
    return (errors==0)

回答于 2025-04-16 由 Python大师

分享举报

没有可靠的方法来获取这个信息。子域名是任意的，而且每天都有很多新的域名后缀出现，形成了一个庞大的列表。最好的办法就是对照这个庞大的域名后缀列表，并且自己维护这个列表。

列表链接： http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

回答于 2025-04-16 由 Python大师

分享举报

从域名提取二级域名？- Python

6 个回答

撰写回答