Beautiful Soup 行匹配

0 投票

2 回答

2333 浏览

数据工程师

提问于 2025-04-17 02:59

我正在尝试创建一个只包含表头和我关心的那一行的HTML表格。我要使用的网站是 http://wolk.vlan77.be/~gerben。

我想获取表头和我的表格条目，这样我就不用每次都去找我的名字了。

我想做的事情：

获取HTML页面
解析它以获取表头
解析它以获取与我相关的那一行（也就是包含lucas的那一行）
构建一个HTML页面，显示与我相关的表头和条目

我现在的做法：

首先用beautifulsoup获取表头
获取我的条目
将这两者添加到一个数组中

将这个数组传递给一个方法，生成一个可以打印为HTML页面的字符串

def downloadURL(self): global input filehandle = self.urllib.urlopen('http://wolk.vlan77.be/~gerben') input = '' for line in filehandle.readlines(): input += line filehandle.close()

def soupParserToTable(self,input):
    global header

    soup = self.BeautifulSoup(input)
    header = soup.first('tr')
    tableInput='0'

    table = soup.findAll('tr')
    for line in table:
        print line
        print '\n \n'
        if '''lucas''' in line:
            print 'true'
        else:
            print 'false'
        print '\n \n **************** \n \n'

我想从HTML文件中获取包含lucas的那一行，但当我这样运行时，输出却是：

 **************** 


<tr><td>lucas.vlan77.be</td> <td><span style="color:green;font-weight:bold">V</span></td> <td><span style="color:green;font-weight:bold">V</span></td> <td><span style="color:green;font-weight:bold">V</span></td> </tr>



false

我不明白为什么没有匹配上，字符串lucas明明在里面啊 :/？

web scraping beautiful soup HTML data parsing table extraction data filtering string matching html generation

2 个回答

这是因为line不是一个字符串，而是一个BeautifulSoup.Tag实例。你可以试着获取td的值：

if '''lucas''' in line.td.string:

回答于 2025-04-17 由 Python大师

分享举报

看起来你把这个问题搞得太复杂了。

这里有个更简单的版本……

>>> import BeautifulSoup
>>> import urllib2
>>> html = urllib2.urlopen('http://wolk.vlan77.be/~gerben')
>>> soup = BeautifulSoup.BeautifulSoup(html)
>>> print soup.find('td', text=lambda data: data.string and 'lucas' in data.string)
lucas.vlan77.be

回答于 2025-04-17 由 Python大师

分享举报

Beautiful Soup 行匹配

2 个回答

撰写回答