python提取域/子域

2024-05-16 23:16:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在此字符串中查找域和子域。我使用正则表达式提取非ascii字符,但没有任何更改

data = [{"data":
 "0\\x1e\\x82*.extractdomain.com\\x82\\x0ctest.extractdomain.com",
                 "name": "subjectAltName"
            }]

text = ''.join([i if ord(i) < 128 else ' ' for i in data["data"])

Tags: 子域字符串textnamecomdataifascii
1条回答
网友
1楼 · 发布于 2024-05-16 23:16:10
  • 你需要把你的文字按正确的面额分开
  • 安全地排除非ascii字符,在您的情况下,非ascii字符依次为字符

注意:(仔细检查\x1e的长度是一个字符还是四个字符)


import re

def extract_url(url):
    chunks = url.split(".")
    subdomain, domain = ".".join(chunks[:-2]), ".".join((chunks[-2], chunks[-1]))
    return (subdomain, domain)

# splits your text by .com
sites = re.split("(?<=\.com)", data[0]["data"])

# replaces all non-ascii strings (if they're more than 1 char in length)
extracted_sites = [re.sub(r'\\x([0-9a-f]){2}','', site) for site in sites if site]
# replaces all non-ascii strings (if they're single-character)
extracted_sites = ["".join([c for c in site if ord(c) < 128]) for site in sites if site]

print([extract_url(url) for url in extracted_sites])

输出(子域、域):

[('0*', 'extractdomain.com'), ('test', 'extractdomain.com'), ('', 'hello.com')]

相关问题 更多 >