Python使用split从HTML提取数据

5 投票

4 回答

16761 浏览

提问于 2025-04-17 16:50

某个从网址获取的页面，内容格式如下：

<p>
    <strong>Name:</strong> Pasan <br/>
    <strong>Surname: </strong> Wijesingher <br/>                    
    <strong>Former/AKA Name:</strong> No Former/AKA Name <br/>                    
    <strong>Gender:</strong> Male <br/>
    <strong>Language Fluency:</strong> ENGLISH <br/>                    
</p>

我想提取里面的姓名、姓氏等信息（我需要对很多页面重复这个操作）

为此，我尝试使用以下代码：

import urllib2

url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)

start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]

start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]

print(givenName)
print(surname)

当我只调用一次source.read.split方法时，它工作得很好。但是当我调用两次时，就出现了“列表索引超出范围”的错误。

有没有人能给我建议一个解决办法？

错误处理数据提取网页抓取 html解析列表索引重复操作

4 个回答

如果你想快速解决问题，正则表达式（regex）在这种情况下会更有用。一开始学习可能会有点困难，但将来正则表达式会帮你大忙。

试试这个代码：

# read the whole document into memory
full_source = source.read()  

NAME_RE = re.compile('Name:.+?>(.*?)<')
SURNAME_RE = re.compile('Surname:.+?>(.*?)<')

name = NAME_RE.search(full_source, re.MULTILINE).group(1).strip()
surname = SURNAME_RE.search(full_source, re.MULTILINE).group(1).strip()

想了解更多关于如何在Python中使用正则表达式的信息，可以查看这里。

如果想要一个更全面的解决方案，可以考虑解析HTML（使用像BeautifulSoup这样的库），但根据你的具体需求，这可能会显得有些复杂。

回答于 2025-04-17 由 Python大师

分享举报

你调用了两次read()，这就是问题所在。你应该只调用一次read()，把读取到的数据存储在一个变量里，然后在需要用到这些数据的地方直接使用这个变量。可以这样做：

fetched_data = source.read()

然后稍后...

givenName=(fetched_data.split(start))[1].split(end)[0]

还有...

surname=(fetched_data.split(start))[1].split(end)[0]

这样就可以了。你代码不工作的原因是，read()方法第一次读取内容时，读取完后就到了内容的末尾。下次再调用read()时，它已经没有内容可以读取了，所以就会报错。

可以查看一下urllib2的文档和文件对象的方法。

回答于 2025-04-17 由 Python大师

分享举报

你可以使用BeautifulSoup来解析HTML字符串。

下面是你可以尝试的一些代码，
它使用BeautifulSoup（从HTML代码中提取文本），然后解析字符串以提取数据。

from bs4 import BeautifulSoup as bs

dic = {}
data = \
"""
    <p>
        <strong>Name:</strong> Pasan <br/>
        <strong>Surname: </strong> Wijesingher <br/>                    
        <strong>Former/AKA Name:</strong> No Former/AKA Name <br/>                    
        <strong>Gender:</strong> Male <br/>
        <strong>Language Fluency:</strong> ENGLISH <br/>                    
    </p>
"""

soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()

# parsing the text
lines = text.splitlines()
for line in lines:
    # check if line has ':', if it doesn't, move to the next line
    if line.find(':') == -1: 
        continue    
    # split the string at ':'
    parts = line.split(':')

    # You can add more tests here like
    # if len(parts) != 2:
    #     continue

    # stripping whitespace
    for i in range(len(parts)):
        parts[i] = parts[i].strip()    
    # adding the vaules to a dictionary
    dic[parts[0]] = parts[1]
    # printing the data after processing
    print '%16s %20s' % (parts[0],parts[1])

小提示：
如果你打算使用BeautifulSoup来解析HTML，
你应该有一些特定的属性，比如class=input或者id=10，也就是说，你要保持同类型的标签有相同的id或class。

更新
关于你的评论，看看下面的代码，
它应用了上面的提示，让生活（和编程）变得简单多了。

from bs4 import BeautifulSoup as bs

c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
    <p>
       No. 4<br>
       Private Drive,<br>
       Sri Lanka&nbsp;ON&nbsp;&nbsp;K7L LK <br>
"""
soup = bs(data)

for i in soup.find_all('div'):
    # get data using "class" attribute
    addr = ""
    if i.get("class")[0] == u'address': # unicode string
        text = i.get_text()
        for line in text.splitlines(): # line-wise
            line = line.strip() # remove whitespace
            addr += line # add to address string
        c_addr.append(addr)

    # get data using "id" attribute
    addr = ""
    if int(i.get("id")) == 10: # integer
        text = i.get_text()
        # same processing as above
        for line in text.splitlines():
            line = line.strip()
            addr += line
        id_addr.append(addr)

print "id_addr"
print id_addr
print "c_addr"
print c_addr

回答于 2025-04-17 由 Python大师

分享举报

Python使用split从HTML提取数据

4 个回答

撰写回答