lxml:按属性分割?

1 投票
2 回答
1396 浏览
提问于 2025-04-16 19:28

我正在使用lxml来抓取一些看起来像这样的HTML:

<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a> 

我想要的数据格式是这样的:

[ {'category': 'Football', 'title': 'Team A'},
{'category': 'Football', 'title': 'Team B'},
{'category': 'Baseball', 'title': 'Team C'},
{'category': 'Baseball', 'title': 'Team D'}]

到目前为止,我得到了:

results = []
for (i,a) in enumerate(content[0].xpath('./a')):
     data['text'] = a.text
     results.append(data)

但是我不知道怎么通过在font-size处拆分来获取类别名称,同时保留兄弟标签——有没有什么建议?

谢谢!

2 个回答

1

如果你在寻找其他方法(这只是一个选项 - 别太苛责我),或者你无法使用lxml这个库,你可以试试下面这段奇怪的代码:

text = """
            <a href="">Team YYY</a>
            <div align=center><a style="font-size: 1.1em">Polo</a></div>
            <div align=center><a style="font-size: 1.1em">Football</a></div>
            <a href="">Team A</a>
            <a href="">Team B</a>
            <div align=center><a style="font-size: 1.1em">Baseball</a></div>
            <a href="">Team C</a>
            <a href="">Team D</a>
            <a href="">Team X</a>
            <div align=center><a style="font-size: 1.1em">Tennis</a></div>
        """
# next variables could be modified depending on what you really need        
keyStartsWith = '<div align=center><a style="font-size: 1.1em">'
categoryStart = len(keyStartsWith)
categoryEnd = -len('</a></div>')
output = []
data = text.split('\n')    
titleStart = len('<a href="">')
titleEnd = -len('</a>')

getdict = lambda category, title: {'category': category, 'title': title}

# main loop
for i, line in enumerate(data):
    line = line.strip()
    if keyStartsWith in line and len(data)-1 >= i+1:
        category = line[categoryStart: categoryEnd]
        (len(data)-1 == i and output.append(getdict(category, '')))
        if i+1 < len(data)-1 and keyStartsWith in data[i+1]:
            output.append(getdict(category, ''))
        else:
            while i+1 < len(data)-1 and keyStartsWith not in data[i+1]:
                title = data[i+1].strip()[titleStart: titleEnd]
                output.append(getdict(category, title))
                i += 1
3

我用以下代码成功了:

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []
current_category = None

for element in body.xpath('./*'):
    if element.tag == 'div':
        current_category = element.xpath('./a')[0].text
    elif element.tag == 'a':
        results.append({ 'category' : current_category, 
            'title' : element.text })

print results

它会输出:

[{'category': 'Football', 'title': 'Team A'}, 
 {'category': 'Football', 'title': 'Team B'}, 
 {'category': 'Baseball', 'title': 'Team C'}, 
 {'category': 'Baseball', 'title': 'Team D'}]

抓取数据的过程很脆弱。比如说,我们在这里明确依赖于元素的顺序和嵌套关系。不过,有时候这种固定的方式也许就足够用了。


这里还有另一种方法(更侧重于使用xpath),使用了preceding-sibling轴:

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []

for e in body.xpath('./a'):
    results.append(dict(
        category=e.xpath('preceding-sibling::div/a')[-1].text,
        title=e.text))

print results

撰写回答