我有一个html页面,它包含一个表,我想获取该表中td,tr中的所有值。
我试过使用beautifulsoup,但现在我想使用python来处理lxml或HML解析器。
我已附上示例。
我想获取作为元组列表的值
[
[( value of 2050 jan, value of main subject-part1-sub part1-subject1 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject1 ),... ],
[( value of 2050 jan, value of main subject-part1-sub part1-subject2 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject2 )... ]
]
等等。
有人能告诉我如何使用lxml或HTML python解析器以非常“最佳”的方式处理这个问题吗?
示例:test.html
<HTML>
<HEAD>
<TITLE>Title</TITLE>
</HEAD>
<BODY>
<TABLE BORDER>
<TR ALIGN=LEFT>
<TH COLSPAN=38>Main Subject</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=18>part1</TH>
<TH VALIGN=TOP COLSPAN=18>part2</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=9>sub-part1</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part2</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part3</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part4</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=1>subject1</TH>
<TH VALIGN=TOP COLSPAN=1>subject2</TH>
<TH VALIGN=TOP COLSPAN=1>subject10</TH>
<TH VALIGN=TOP COLSPAN=1>subject11</TH>
<TH VALIGN=TOP COLSPAN=1>subject12</TH>
<TH VALIGN=TOP COLSPAN=1>subject13</TH>
<TH VALIGN=TOP COLSPAN=1>subject14</TH>
<TH VALIGN=TOP COLSPAN=1>subject15</TH>
<TH VALIGN=TOP COLSPAN=1>subject16</TH>
<TH VALIGN=TOP COLSPAN=1>subject17</TH>
<TH VALIGN=TOP COLSPAN=1>subject18</TH>
<TH VALIGN=TOP COLSPAN=1>subject19</TH>
<TH VALIGN=TOP COLSPAN=1>subject20</TH>
<TH VALIGN=TOP COLSPAN=1>subject21</TH>
<TH VALIGN=TOP COLSPAN=1>subject22</TH>
<TH VALIGN=TOP COLSPAN=1>subject23</TH>
<TH VALIGN=TOP COLSPAN=1>subject24</TH>
<TH VALIGN=TOP COLSPAN=1>subject25</TH>
<TH VALIGN=TOP COLSPAN=1>subject26</TH>
<TH VALIGN=TOP COLSPAN=1>subject27</TH>
<TH VALIGN=TOP COLSPAN=1>subject28</TH>
<TH VALIGN=TOP COLSPAN=1>subject29</TH>
<TH VALIGN=TOP COLSPAN=1>subject30</TH>
<TH VALIGN=TOP COLSPAN=1>subject31</TH>
<TH VALIGN=TOP COLSPAN=1>subject32</TH>
<TH VALIGN=TOP COLSPAN=1>subject33</TH>
<TH VALIGN=TOP COLSPAN=1>subject34</TH>
<TH VALIGN=TOP COLSPAN=1>subject35</TH>
<TH VALIGN=TOP COLSPAN=1>subject36</TH>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT VALIGN=TOP ROWSPAN=12>2050</TH>
<TH ALIGN=LEFT>January</TH>
<TD>0</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>4</TD>
<TD>16</TD>
<TD>0</TD>
<TD>6</TD>
<TD>2</TD>
<TD>2</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>3</TD>
<TD>2</TD>
<TD>0</TD>
<TD>26</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>5</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>2</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>February</TH>
<TD>1</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>2</TD>
<TD>4</TD>
<TD>1</TD>
<TD>6</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>25</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>4</TD>
<TD>14</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>March</TH>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>4</TD>
<TD>7</TD>
<TD>0</TD>
<TD>9</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>9</TD>
<TD>0</TD>
<TD>45</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>10</TD>
<TD>16</TD>
<TD>0</TD>
<TD>5</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>April</TH>
<TD>1</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>3</TD>
<TD>12</TD>
<TD>1</TD>
<TD>11</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>2</TD>
<TD>34</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>6</TD>
<TD>18</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>5</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>May</TH>
<TD>7</TD>
<TD>0</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>4</TD>
<TD>1</TD>
<TD>13</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>7</TD>
<TD>1</TD>
<TD>30</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>5</TD>
<TD>12</TD>
<TD>0</TD>
<TD>4</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>6</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>June</TH>
<TD>0</TD>
<TD>1</TD>
<TD>14</TD>
<TD>0</TD>
<TD>7</TD>
<TD>15</TD>
<TD>0</TD>
<TD>17</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>24</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>6</TD>
<TD>13</TD>
<TD>1</TD>
<TD>9</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>July</TH>
<TD>0</TD>
<TD>1</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>17</TD>
<TD>1</TD>
<TD>15</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>10</TD>
<TD>0</TD>
<TD>2</TD>
<TD>15</TD>
<TD>2</TD>
<TD>53</TD>
<TD>0</TD>
<TD>3</TD>
<TD>3</TD>
<TD>6</TD>
<TD>0</TD>
<TD>7</TD>
<TD>16</TD>
<TD>0</TD>
<TD>9</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>August</TH>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>8</TD>
<TD>15</TD>
<TD>1</TD>
<TD>17</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>16</TD>
<TD>0</TD>
<TD>33</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>11</TD>
<TD>0</TD>
<TD>2</TD>
<TD>25</TD>
<TD>4</TD>
<TD>8</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>September</TH>
<TD>2</TD>
<TD>0</TD>
<TD>10</TD>
<TD>0</TD>
<TD>16</TD>
<TD>22</TD>
<TD>2</TD>
<TD>19</TD>
<TD>4</TD>
<TD>2</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>8</TD>
<TD>0</TD>
<TD>27</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>11</TD>
<TD>31</TD>
<TD>1</TD>
<TD>9</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>October</TH>
<TD>3</TD>
<TD>1</TD>
<TD>8</TD>
<TD>0</TD>
<TD>4</TD>
<TD>28</TD>
<TD>0</TD>
<TD>15</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>6</TD>
<TD>0</TD>
<TD>15</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>9</TD>
<TD>26</TD>
<TD>1</TD>
<TD>8</TD>
<TD>4</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>November</TH>
<TD>0</TD>
<TD>3</TD>
<TD>3</TD>
<TD>0</TD>
<TD>6</TD>
<TD>23</TD>
<TD>1</TD>
<TD>8</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>3</TD>
<TD>7</TD>
<TD>1</TD>
<TD>20</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>3</TD>
<TD>18</TD>
<TD>3</TD>
<TD>7</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>December</TH>
<TD>1</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>4</TD>
<TD>13</TD>
<TD>2</TD>
<TD>15</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>29</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>3</TD>
<TD>20</TD>
<TD>1</TD>
<TD>13</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
</TABLE>
</BODY>
</HTML>
我无法添加评论,但可能对其他人有帮助:
我在表格单元格中有一些粗体和斜体文本,因此
c.text
返回了None
。我用c.text_content()
来代替:这样的做法应该管用:
您可以使用
zip()
来转置表:相关问题 更多 >
编程相关推荐