>>> from urllib.parse import urlparse # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever') # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '') # as per your case
>>> print(result)
'stackoverflow.com/'
>>> import tldextract # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.
import publicsuffix
import urlparse
def get_base_domain(url):
# This causes an HTTP request; if your script is running more than,
# say, once a day, you'd want to cache it yourself. Make sure you
# update frequently, though!
psl = publicsuffix.fetch()
hostname = urlparse.urlparse(url).hostname
return publicsuffix.get_public_suffix(hostname, psl)
URL的一般结构:
正如“TIMTOWTDI”的座右铭:
使用urlparse,
使用tldextract
就你而言:
干杯!:)
使用urlparse获取主机名非常简单:
然而,获取“根域”将是一个更大的问题,因为它不是在语法意义上定义的。“www.theregister.co.uk”的根域是什么?使用默认域的网络怎么样?”devbox12“可能是有效的主机名。
处理这个问题的一种方法是使用Public Suffix List,它试图对真正的顶级域(例如“.com”、“.net”、“.or g”)和使用的私有域(例如“.co.uk”甚至“.github.io”)进行编目。您可以使用publicsuffix2库从Python访问PSL:
以下脚本并不完美,但可以用于显示/缩短。如果您真的想/需要避免任何第三方依赖关系-特别是远程获取和缓存一些tld数据,我可以建议您遵循我在项目中使用的脚本。它将域的最后两部分用于最常见的域扩展,将最后三部分用于其余不太知名的域扩展。在最坏的情况下,场景域将有三个部分而不是两个:
相关问题 更多 >
编程相关推荐