Python编辑文本以仅保留给定tex中URL中的域名

2024-04-26 06:29:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经设法从某些网站上获取评论,但是它会删除HTML,因此有很多完整的URL,我想为NLTK分析清理这些URL,而不需要常量“https://wwww.”干扰单词结果的频率

例如,如果我有以下文本:

In the world of selective animal breeding (https://en.wikipedia.org/wiki/Selective_breeding), to "breed true" means that specimens of an animal breed will breed true-to-type when mated like-to-like; that is, that the progeny of any two individuals in the same breed will show consistent, replicable and predictable characteristics. A puppy from two purebred dogs of the same breed, for example, will exhibit the traits of its parents, and not the traits of all breeds in the subject breed's ancestry.

However, breeding from too small a gene pool, especially direct inbreeding (https://en.wikipedia.org/wiki/Inbreeding), can lead to the passing on of undesirable characteristics or even a collapse of a breed population due to inbreeding depression. Therefore, there is a question, and often heated controversy, as to when or if a breed may need to allow "outside" stock in for the purpose of improving the overall health and vigor of the breed.

Because pure-breeding creates a limited gene pool, purebred animal breeds are also susceptible to a wide range of congenital health problems.[1] This problem is especially prevalent in competitive dog breeding and dog show circles due to the singular emphasis on aesthetics rather than health or function. Such problems also occur within certain segments of the horse industry for similar reasons. The problem is further compounded when breeders practice

我希望()中的URL只是域名,在本例中是(wikipedia)。我在使用regex时发现了一个类似的问题,但是它将URL(在本例中)改为(wikipedia.com)。我不需要域名后缀,只需要网站/域名的名称。如果是en.wikipedia.com->;维基百科

我找到的正则表达式代码是re.sub('https*://[\w\.]+\.com[\w/\-]+|https*://[\w\.]+\.com|[\w\.]+\.com/[\w/\-]+', lambda x:re.findall('(?<=\://)[\w\.]+\.com|[\w\.]+\.com', x.group())[0], s)


Tags: andofthetoinhttpscomurl
1条回答
网友
1楼 · 发布于 2024-04-26 06:29:31

您希望使用标准库的urllib.parse模块(Python2中的urlparse)

一旦有了netloc,只需删除端口号(如果有)并在点上拆分域字符串即可

相关问题 更多 >