Python，匹配不等长的刮削列表

<td align=center>19/11/11 12:01:21 AM</td> <td align=center><a href=profiles.php?XID=1>player1</a> hospitalized <a href=profiles.php?XID=2>player2</a></td>

<td align="center">19/11/11 12:58:03 AM</td> <td align=center><a href=profiles.php?XID=3>player3</a> attacked <a href=profiles.php?XID=1>player1</a> and lost </td>

import mechanize import re htmlA1 = br.response().read() patAttackDate = re.compile('<td align=center>(\d+/\d+/\d+) (\d+:\d+:\d+ \w+)') patAttackName = re.compile('(\w+)</a> hospitalized ') searchAttackDate = re.findall(patAttackDate, htmlA1) searchAttackName = re.findall(patAttackName, htmlA1) pairs = zip(searchAttackDate, searchAttackName) for i in pairs: print (i)

(('19/11/11', '9:47:51 PM'), 'user1') <- mismatch (('19/11/11', '8:21:18 PM'), 'user1') <- mismatch (('19/11/11', '7:33:00 PM'), 'user1') <- As a consequence of the below, the rest upwards are mismatched (('19/11/11', '7:32:38 PM'), 'user2') <- NOT a match, case B (('19/11/11', '7:32:22 PM'), 'user2') <- match ok (('19/11/11', '7:26:53 PM'), 'user2') <- match ok (('19/11/11', '7:25:24 PM'), 'user3') <- match ok (('19/11/11', '7:24:22 PM'), 'user3') <- match ok (('19/11/11', '7:23:25 PM'), 'user3') <- match ok

import mechanize import re from BeautifulSoup import BeautifulSoup htmlA1 = br.response().read() stripped = htmlA1.replace(">\n<","><") #Removed all '\n' from code soup = BeautifulSoup(stripped) table = soup.find('table', width='90%') table2 = table.findNext('table', width='90%') table3 = table2.findNext('table', width='90%') #this is the table I need to work with patAttackDate = re.compile('<td align="center">(\d+/\d+/\d+) (\d+:\d+:\d+ \w+)') searchAttackDate = re.findall(patAttackDate, table3) print searchAttackDate

Test = '''<table><tr><td>date</td></tr></table>''' soupTest = BeautifulSoup(Test) test2 = soupTest.find('table') patTest = re.compile('<td>(.*)</td>') searchTest = patTest.findall(test2.getText()) print test2 # gives: <table><tr><td>date</td></tr></table> print type(test2) # gives: <class 'BeautifulSoup.Tag'> print searchTest #gives: []

import re import mechanize from BeautifulSoup import BeautifulSoup htmlA1 = br.response().read() stripped = htmlA1.replace(">\n<","><") #stripped '\n' from html soup = BeautifulSoup(stripped) table = soup.find('table', width='90%') table2 = table.findNext('table', width='90%') table3 = table2.findNext('table', width='90%') #table I need to work with print type(table3) # gives <class 'BeautifulSoup.Tag'> strTable3 = str(table3) #convert table3 to string type so i can regex it patFinal = re.compile(('(\d+/\d+/\d+) (\d+:\d+:\d+ \w+)</td><td align="center">' '<a href="profiles.php\?XID=(\d+)">' '(\w+)</a> hospitalized <a'), re.IGNORECASE) searchFinal = re.findall(patFinal, strTable3) for i in searchFinal: print (i)

('19/11/11', '1:08:07 AM', 'ID_user1', 'user1') ('19/11/11', '1:06:55 AM', 'ID_user1', 'user1') ('19/11/11', '1:05:46 AM', 'ID_user1', 'user1') ('19/11/11', '1:04:33 AM', 'ID_user1', 'user1') ('19/11/11', '1:03:32 AM', 'ID_user1', 'user1') ('19/11/11', '1:02:37 AM', 'ID_user1', 'user1') ('19/11/11', '1:00:43 AM', 'ID_user1', 'user1') ('19/11/11', '12:55:35 AM', 'ID_user2', 'user2')

import re reAttack = (r'<td\s+align=center>(\d+/\d+/\d+) (\d+:\d+:\d+\s+\w+)</td>\s*' '<td.*?' #accounts for the '\n' '<font\s+color=#006633>(\w+)</a>\s+hospitalized\s+') for m in re.finditer(reAttack, htmlA1): print 'date: %s; time: %s; player: %s' % (m.group(1), m.group(2), m.group(3))

date: 19/11/11; time: 1:08:07 AM; player: user1 date: 19/11/11; time: 1:06:55 AM; player: user1 date: 19/11/11; time: 1:05:46 AM; player: user1 date: 19/11/11; time: 1:04:33 AM; player: user1 date: 19/11/11; time: 1:03:32 AM; player: user1 date: 19/11/11; time: 1:02:37 AM; player: user1 date: 19/11/11; time: 1:00:43 AM; player: user1 date: 19/11/11; time: 12:55:35 AM; player: user2

3条回答

网友

1楼 · 编辑于 2024-04-19 02:33:33

这对我有用：

reAttack = r'<td\s+align=center>(\d+/\d+/\d+)<br>(\d+:\d+:\d+\s+\w+)</td>\s*<td.*?<font\s+color=#006633>(\w+)</font></a>\s+hospitalized\s+'

for m in re.finditer(reAttack, htmlA1):
  print 'date: %s; time: %s; player: %s' % (m.group(1), m.group(2), m.group(3))

live demo

在一个正则表达式中执行所有操作都会使正则表达式更加混乱，但这比单独匹配每个TD并在之后尝试同步它们要容易得多。regex中间附近的.*?假设所有元素都由新行分隔，如您的原始示例中所示。如果不能假设，则应该将.*?替换为(?:(?!/?td>).)*，以便在当前TD元素中包含匹配项。在

仅供参考，你的样本数据有些不一致。有些属性值被引用了，而大多数属性值没有被引用，并且您混合了 和{}标记。我为我的演示规范化了所有内容，但是如果这代表了您的真实数据，您将需要一个更复杂的regex。或者您可以切换到纯DOM解决方案，这在一开始可能会更容易。；）

网友

2楼 · 编辑于 2024-04-19 02:33:33

从你的描述来看，我还没弄清楚你到底想干什么。但我现在可以告诉你一件事：对于正则表达式，Python原始字符串是您的朋友。在

尝试使用r'pattern'而不是仅仅在beauthoulsoup程序中使用'pattern'。在

另外，当您使用正则表达式时，有时从简单的模式开始，验证它们是否有效，然后构建它们是很有价值的。你已经直接进入了复杂的模式，我确信它们不起作用，因为你没有使用原始字符串和反斜杠。在

网友

3楼 · 编辑于 2024-04-19 02:33:33

.findNext()方法将返回一个BeautifulSoup.Tag对象，该对象不能传递给re.findall。因此，您需要使用.getText()（或类似的方法从Tag对象获取文本。或者.contents获取该标记内的html）。另外，当使用re.compile时，返回的对象是您需要调用findall的对象。在

这个：

soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%') #this is the table I need to work with

patAttackDate = re.compile('<td align="center">(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)')
searchAttackDate = re.findall(patAttackDate, table3)

应该这样写（最后一行是唯一需要更改的内容）：

^{pr2}$

BeautifulSoup Documentation

从^{} docs：

re.compile(pattern, flags=0)
Compile a regular expression pattern into a regular expression object.
This:
result = re.match(pattern, string)
is equivalent to:
prog = re.compile(pattern)
result = prog.match(string)

相关问题更多 >

编程相关推荐

热门问题

热门文章