在使用python和beautiful soup的xml页面上找不到关键字的组合

elif len(keywords) == 2: keyword1 = keywords[0] keyword2 = keywords[1] print("Searching for product...") keywordLinkFound = False while keywordLinkFound is False: html = self.driver.page_source soup = BeautifulSoup(html, 'lxml') try: keywordLink = soup.find('loc', text=re.compile(keyword1 + keyword2)).text return keywordLink except AttributeError: print("Product not found on site, retrying...") time.sleep(monitorDelay) self.driver.refresh() break

<url> <loc> https://packershoes.com/products/copy-of-382-new-balance-m999jtc-1 </loc> <lastmod>2018-12-04T21:49:25-05:00</lastmod> <changefreq>daily</changefreq> <image:image> <image:loc> https://cdn.shopify.com/s/files/1/0208/5268/products/NB999JTC-2_4391df07-a3a2-4c82-87b3-49d776096473.jpg?v=1543851653 </image:loc> <image:title>NEW BALANCE M999JTC "MADE IN USA"</image:title> </image:image> </url> <url> <loc> https://packershoes.com/products/copy-of-382-packer-x-new-era-new-york-yankee-duck-canvas-1 </loc> <lastmod>2018-12-06T14:39:37-05:00</lastmod> <changefreq>daily</changefreq> <image:image> <image:title> NEW ERA JAPAN 59FIFTY NEW YORK YANKEES "DUCK CANVAS" </image:title> </image:image> </url>

1条回答

网友

1楼 · 发布于 2024-04-26 04:52:21

keyword1 + keyword2是字符串yankeeduck，因此您正在搜索该字符串，如果两个单词没有那样连接，则该字符串将不匹配。你需要允许他们之间的任何事情，以及以相反的顺序识别他们。因此，regexp应该是：

yankee.*duck|duck.*yankee

因此，守则应为：

regexp = "%s.*%s|%s.%s"%(keyword1, keyword2, keyword2, keyword1)
keywordLink = soup.find('loc', text=re.compile(regexp)).text

如果关键字包含regexp中的特殊字符，则应将其转义：

keyword1 = re.escape(keywords[0])
keyword2 = re.escape(keywords[1])

相关问题更多 >

编程相关推荐

热门问题

热门文章