正则表达式未按要求工作

2024-04-19 20:05:45 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我的HTML代码:

<ul class="hide menuSearchType">
    <li><a href="../../dynamic/city_select.aspx">Search by city</a></li>
    <li><a href="../../searchbyphone.aspx">Search by phone</a></li>
    <li><a href="../searchbyaddress.aspx">Search by address</a></li>
    <li><a href="../searchbybrand.aspx">Search by brand</a></li>
    <li><a href="/advertisement-center/">Advertise with us</a></li>
    <li><a href="/advertisement-center/">Advertise with us</a></li>
    <li><a href="//fonts.googleapis.com/css?family=Open+Sans">Find a Person</a></li>
    <li><a href="//fonts.googleapis.com/css?family=Open+Sans">Find a Person</a></li>
    <li><a href="dynamic/city_select.aspx">Search by city</a></li>
    <li><a href="searchbybrand.aspx">Search by brand</a></li>
</ul>

下面是我的Python代码:

import re, os
from urllib.parse import urlparse

url = "http://www.phonebook.com.pk/dynamic/search.aspx?searchtype=cat&class_id=2566" 

path = urlparse(url)
lpath = os.path.dirname(path.path)

html = u"<ul class=\"hide menuSearchType\">\n    <li><a href=\"../../dynamic/city_select.aspx\">Search by city</a></li>\n    <li><a href=\"../../searchbyphone.aspx\">Search by phone</a></li>\n    <li><a href=\"../searchbyaddress.aspx\">Search by address</a></li>\n    <li><a href=\"../searchbybrand.aspx\">Search by brand</a></li>\n    <li><a href=\"/advertisement-center/\">Advertise with us</a></li>\n    <li><a href=\"/advertisement-center/\">Advertise with us</a></li>\n    <li><a href=\"//fonts.googleapis.com/css?family=Open+Sans\">Find a Person</a></li>\n    <li><a href=\"//fonts.googleapis.com/css?family=Open+Sans\">Find a Person</a></li>\n    <li><a href=\"dynamic/city_select.aspx\">Search by city</a></li>\n    <li><a href=\"searchbybrand.aspx\">Search by brand</a></li>\n</ul>"

linkList1 = re.findall(re.compile(u'(?<=href=")../.*?(?=")'), str(html))

for link1 in linkList:
    html = re.sub(link1, path.scheme + "://" + os.path.normpath(path.netloc + os.path.abspath(lpath + "/" + link1)), str(html))

print (html)

问题是它检测到带有“../”的链接,但是“../../”也被更改了,有没有办法限制我的正则表达式只选择带有单个“../”的url?你知道吗

预期产量:

<ul class="hide menuSearchType">
    <li><a href="../../dynamic/city_select.aspx">Search by city</a></li>
    <li><a href="../../searchbyphone.aspx">Search by phone</a></li>
    <li><a href="http://www.phonebook.com.pk/searchbyaddress.aspx">Search by address</a></li>
    <li><a href="http://www.phonebook.com.pk/searchbybrand.aspx">Search by brand</a></li>
    <li><a href="/advertisement-center/">Advertise with us</a></li>
    <li><a href="/advertisement-center/">Advertise with us</a></li>
    <li><a href="//fonts.googleapis.com/css?family=Open+Sans">Find a Person</a></li>
    <li><a href="//fonts.googleapis.com/css?family=Open+Sans">Find a Person</a></li>
    <li><a href="dynamic/city_select.aspx">Search by city</a></li>
    <li><a href="searchbybrand.aspx">Search by brand</a></li>
</ul>

Tags: pathcomcitysearchbydynamicliul
3条回答

按要求使用BeautifulSoup:

from bs4 import Beautifulsoup
soup = BeautifulSoup(html)
all = soup.select('li')
for i in all:
    try:
        output = re.sub(r'(?is)(href="../)([^.])','http://www.phonebook.com.pk/'+r'\2',str(i))
    except:
        output = i
    print(output)

尝试使用以下选项:

linkList1 = re.findall(re.compile(u'(?<=href=")../\w.*?(?=")'), str(html))

这保证了斜杠后面必须有一个单词字符。你知道吗

你可以用正则表达式替换字符串

output = re.sub(r'(?is)(href="../)([^.])','http://www.phonebook.com.pk/'+r'\2',str(html))

相关问题 更多 >