Python正则表达式Tokeniz

3条回答

网友

1楼 · 编辑于 2024-05-17 14:11:00

Regex是非常强大的工具，但它们可能不是所有情况下的工具（正如其他人已经建议的那样）。也就是说，下面是控制台中使用-as-per-request-regex的最小示例：

>>> import re
>>> s = 'blahblahblah (a href="example.com") another bla <a href="subdomain.example2.net">'
>>> re.findall(r'a href="(.*?)"', s)
['example.com', 'subdomain.example2.net']

专注于r'a href="(.*?)"'。在英语中，它的意思是：“找到一个以a href="开头的字符串，然后将任何字符保存为结果，直到您找到下一个"。语法是：

()的意思是“只在这里保存东西”
.表示“任何字符”
*表示“任何次数”
?的意思是“非贪心”，或者用其他术语来说：找到满足要求的最短字符串（尝试不带问号，您将看到结果）。在

啊！在

网友

2楼 · 编辑于 2024-05-17 14:11:00

不要使用regexp:

这就是为什么在处理HTML或XML（或url）时应该首先使用not think at regex。在

如果您仍然希望使用regex，

您可以找到几种完成这项工作的模式，以及获取您希望找到的字符串的几种方法。在

这些模式起到了作用：

r'$a href="(.*?)"$'

r'$a href="(.*)"$'

r'$a href="(+*)"$'

1。关于芬德尔（）

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

^{pr2}$

2。搜索（）

re.search(pattern, string, flags=0)

Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance.

然后，按re.group()分组。例如，使用regex r'$a href="(.+?(.).+?)"$'，这也适用于这里，您有几个封闭的组：组0与整个模式匹配，组1与第一个用括号括起来的封闭子模式匹配，(.+?(.).+?)

当只查找第一次出现的模式时，可以使用search。以你的例子来说

>>> st = 'blahblahblah (a href="example.com") another bla (a href="polymer.edu")'
>>> m=re.search(r'\(a href="(.+?(.).+?)"\)', st)
>>> m.group(1)
'example.com'

网友
3楼 · 编辑于 2024-05-17 14:11:00

有一个很棒的模块叫做beauthoulsoup（link:http://www.crummy.com/software/BeautifulSoup/），它非常适合解析HTML。您应该使用这个而不是使用regex从HTML获取信息。下面是BeautifulSoup的一个示例：

>>> from bs4 import BeautifulSoup
>>> html = """<p> some <a href="http://link.com">HTML</a> and <a href="http://second.com">another link</a></p>"""
>>> soup = BeautifulSoup(html)
>>> mylist = soup.find_all('a')
>>> for link in mylist:
...    print link['href']
http://link.com
http://second.com

这里有一个指向文档的链接，很容易理解：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

相关问题更多 >

编程相关推荐

热门问题

热门文章