pd.read\u html更改了数字格式

2024-05-16 13:15:26 发布

您现在位置:Python中文网/ 问答频道 /正文

pd.read_html格式更改为123456后,无法从CCCCCCC列获取1,2,3,4,5,6,并且我的预期结果应保持1,2,3,4,5,6

HTML代码

html = """<html>
<body>
<div id="MMMMMMMM" class="MMMMMMMMMMM" style="">
        <table class="OOOOOOOO" style="">
            <thead>
                <tr class="PPPPPPPPPP">
                    <td colspan="3" style="font-size:14px;font-weight:bold;" class="QQQQQQQQQQ">AAAAAAA</td>
                </tr>
                <tr class="RRRRRRRRRR">
                    <td>BBBBBB</td>
                    <td>CCCCCCC</td>
                    <td>AAAAAAA</td>
                </tr>
            </thead>
            <tbody>
                    <tr class="SSSSSSSS">
                        <td rowspan="1">DDDDDD</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="3">EEEEEEEEE</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                        <tr class="">
                            <td class="L_LLLL67">1,2,3,4,5,6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
                        <tr class="">
                            <td class="L_LLLL67">1,2,3,4,5,6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
                    <tr class="">
                        <td rowspan="1">FFFFFFFFF</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTTT">
                        <td rowspan="1">GGGGGGGGG</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="1">HHHHHHHHH</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTTTT">
                        <td rowspan="1">IIIIIIIIII</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="1">JJJJJJJJ</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTT">
                        <td rowspan="2">KKKKKKKK</td>
                        <td class="L_LLLL67">1/2/3/4/5/6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                        <tr class="TTTTTT">
                            <td class="L_LLLL67">1/2/3/4/5/6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
            </tbody>
        </table>
</body>
</html>"""

Python代码

from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup(html,'html.parser')
table = soup.find('div', attrs={'id':'MMMMMMMM'})
df_list = pd.read_html(str(table), header=1)
df_list

执行结果

 [        BBBBBB      CCCCCCC  AAAAAAA
 0       DDDDDD       123456  1234.56
 1    EEEEEEEEE       123456  1234.56
 2    EEEEEEEEE       123456  1234.56
 3    EEEEEEEEE       123456  1234.56
 4    FFFFFFFFF       123456  1234.56
 5    GGGGGGGGG       123456  1234.56
 6    HHHHHHHHH       123456  1234.56
 7   IIIIIIIIII       123456  1234.56
 8     JJJJJJJJ       123456  1234.56
 9     KKKKKKKK  1/2/3/4/5/6  1234.56
 10    KKKKKKKK  1/2/3/4/5/6  1234.56]

预期结果

 [        BBBBBB      CCCCCCC  AAAAAAA
 0       DDDDDD       1,2,3,4,5,6  1234.56
 1    EEEEEEEEE       1,2,3,4,5,6  1234.56
 2    EEEEEEEEE       1,2,3,4,5,6  1234.56
 3    EEEEEEEEE       1,2,3,4,5,6  1234.56
 4    FFFFFFFFF       1,2,3,4,5,6  1234.56
 5    GGGGGGGGG       1,2,3,4,5,6  1234.56
 6    HHHHHHHHH       1,2,3,4,5,6  1234.56
 7   IIIIIIIIII       1,2,3,4,5,6  1234.56
 8     JJJJJJJJ       1,2,3,4,5,6  1234.56
 9     KKKKKKKK       1/2/3/4/5/6  1234.56
 10    KKKKKKKK       1/2/3/4/5/6  1234.56]
 

Tags: stylehtmltabletartrclasstdpd
1条回答
网友
1楼 · 发布于 2024-05-16 13:15:26

您需要添加thousands参数,并将其默认设置为None,它是','

from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup(html,'html.parser')
table = soup.find('div', attrs={'id':'MMMMMMMM'})
df_list = pd.read_html(str(table), header=1, thousands=None)
df_list
输出:
[        BBBBBB      CCCCCCC  AAAAAAA
 0       DDDDDD  1,2,3,4,5,6  1234.56
 1    EEEEEEEEE  1,2,3,4,5,6  1234.56
 2    EEEEEEEEE  1,2,3,4,5,6  1234.56
 3    EEEEEEEEE  1,2,3,4,5,6  1234.56
 4    FFFFFFFFF  1,2,3,4,5,6  1234.56
 5    GGGGGGGGG  1,2,3,4,5,6  1234.56
 6    HHHHHHHHH  1,2,3,4,5,6  1234.56
 7   IIIIIIIIII  1,2,3,4,5,6  1234.56
 8     JJJJJJJJ  1,2,3,4,5,6  1234.56
 9     KKKKKKKK  1/2/3/4/5/6  1234.56
 10    KKKKKKKK  1/2/3/4/5/6  1234.56]

相关问题 更多 >