从HTML选项卡提取数据

2024-03-29 11:13:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在寻找一种在linux shell环境中从HTML获取特定信息的方法。

这是我感兴趣的一点:

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

我想把它们存储在shell变量中,或者把它们放在从上面的html中提取的键值对中。示例:

Tests         : 103
Failures      : 24
Success Rate  : 76.70 %
and so on..

我现在能做的是创建一个java程序,它将使用sax解析器或jsoup之类的html解析器来提取这些信息。

但是,在这里使用java似乎需要在要执行的“包装器”脚本中包含可运行jar。

我确信一定有“shell”语言可以做同样的事情,例如perl、python、bash等

我的问题是我对这些没有经验,有人能帮我解决这个“相当容易”的问题吗

快速更新:

我忘了说,我在.html文档中有更多的表和行对不起(清晨)。

更新#2:

因为我没有根访问权限,所以试图安装这样的Bsoup:

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

错误:

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

更新#3:

运行Tichodromas的答案获取此错误:

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

有什么想法吗?


Tags: infrompytimehtmltabletarshell
3条回答

使用BeautifulSoup4编辑:并适当跳过的Python解决方案。Edit3:使用class="details"选择table):

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print datasets

结果如下:

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

Edit2:若要生成所需的输出,请使用以下内容:

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

结果:

Tests           : 103
Failures        : 24
Success Rate    : 76.70%
Average Time    : 71 ms
Min Time        : 0 ms
Max Time        : 829 ms

下面是最重要的答案,适合Python3的兼容性,并通过去除细胞中的空白来改进:

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(s, 'html.parser')
table = soup.find("table")

# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]

print(headings)

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
    datasets.append(dataset)

print(datasets)

假设html代码存储在mycode.html文件中,下面是一种bash方法:

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')

注意:输出没有完全对齐

相关问题 更多 >