获取Lin的根域

2024-04-25 04:41:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个类似http://www.techcrunch.com/的链接,我只想获得链接的techcrunch.com部分。在python中我该怎么做?


Tags: comhttp链接wwwtechcrunch
3条回答

URL的一般结构:

scheme://netloc/path;parameters?query#fragment

正如“TIMTOWTDI”的座右铭:

使用urlparse

>>> from urllib.parse import urlparse  # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever')  # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '')  # as per your case
>>> print(result)
'stackoverflow.com/'  

使用tldextract

>>> import tldextract  # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

就你而言:

>>> extracted = tldextract.extract('http://www.techcrunch.com/')
>>> '{}.{}'.format(extracted.domain, extracted.suffix)
'techcrunch.com'

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

干杯!:)

使用urlparse获取主机名非常简单:

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname

然而,获取“根域”将是一个更大的问题,因为它不是在语法意义上定义的。“www.theregister.co.uk”的根域是什么?使用默认域的网络怎么样?”devbox12“可能是有效的主机名。

处理这个问题的一种方法是使用Public Suffix List,它试图对真正的顶级域(例如“.com”、“.net”、“.or g”)和使用的私有域(例如“.co.uk”甚至“.github.io”)进行编目。您可以使用publicsuffix2库从Python访问PSL:

import publicsuffix
import urlparse

def get_base_domain(url):
    # This causes an HTTP request; if your script is running more than,
    # say, once a day, you'd want to cache it yourself.  Make sure you
    # update frequently, though!
    psl = publicsuffix.fetch()

    hostname = urlparse.urlparse(url).hostname

    return publicsuffix.get_public_suffix(hostname, psl)

以下脚本并不完美,但可以用于显示/缩短。如果您真的想/需要避免任何第三方依赖关系-特别是远程获取和缓存一些tld数据,我可以建议您遵循我在项目中使用的脚本。它将域的最后两部分用于最常见的域扩展,将最后三部分用于其余不太知名的域扩展。在最坏的情况下,场景域将有三个部分而不是两个:

from urlparse import urlparse

def extract_domain(url):
    parsed_domain = urlparse(url)
    domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme
    domain_parts = domain.split('.')
    if len(domain_parts) > 2:
        return '.'.join(domain_parts[-(2 if domain_parts[-1] in {
            'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'} else 3):])
    return domain

extract_domain('google.com')          # google.com
extract_domain('www.google.com')      # google.com
extract_domain('sub.sub2.google.com') # google.com
extract_domain('google.co.uk')        # google.co.uk
extract_domain('sub.google.co.uk')    # google.co.uk
extract_domain('www.google.com')      # google.com
extract_domain('sub.sub2.voila.fr')   # sub2.voila.fr

相关问题 更多 >