正则表达式提取子字符串python

2024-03-29 14:32:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个精确点提取一个子字符串,直到一个特殊字符“,这是字符串:

element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'

我要提取的部分是关键字from:data keyword=“till:the next”符号,因此在本例中是:aa battery plus

但是我得到的结果是一个字母,用分隔符和方括号限制字符串的左右。你知道吗

我试着用关于芬德尔()方法

import re
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa batteries plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
z = re.search(r'[\bdata-keyword="\b,'""']',element).group(0)
print(z)

这就是我得到的:

d
Process finished with exit code 0

如何只提取关键字? IE:aa电池


Tags: 字符串divfalsedataplusaliaselementkeyword
3条回答

使用Regex解析HTML不是一个好主意。相反,您可以使用类似BeautifulSoup的html解析器。你知道吗

例如:

from bs4 import BeautifulSoup

element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
soup = BeautifulSoup(element, "html.parser")
print(soup.find("div", class_="s-suggestion")["data-keyword"])

输出:

aa battery plus

您可以使用re.findall()函数:

import re
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
output = re.findall(r'data-keyword="(.*?)"', element)[0]
print(output)

输出

aa battery plus

如果您希望文本位于两个字符串之间,则需要使用此正则表达式格式。你知道吗

import re

element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa batteries plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'

z = re.search(r'data-keyword="(.*?)"', element).group(1)
print(z)

相关问题 更多 >