获取tbody中tr的内容

<table class="table table-bordered adoption-status-table"> <thead> <tr> <th>Extent of IFRS application</th> <th>Status</th> <th>Additional Information</th> </tr> </thead> <tbody> <tr> <td>IFRS Standards are required for domestic public companies</td> <td> </td> <td></td> </tr> <tr> <td>IFRS Standards are permitted but not required for domestic public companies</td> <td> <img src="/images/icons/tick.png" alt="tick"> </td> <td>Permitted, but very few companies use IFRS Standards.</td> </tr> <tr> <td>IFRS Standards are required or permitted for listings by foreign companies</td> <td> </td> <td></td> </tr> <tr> <td>The IFRS for SMEs Standard is required or permitted</td> <td> <img src="/images/icons/tick.png" alt="tick"> </td> <td>The IFRS for SMEs Standard is permitted, but very few companies use it. Nearly all SMEs use Paraguayan national accounting standards.</td> </tr> <tr> <td>The IFRS for SMEs Standard is under consideration</td> <td> </td> <td></td> </tr> </tbody> </table>

from bs4 import BeautifulSoup import requests import pandas as pd import re # Site URL url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay" # Make a GET request to fetch the raw HTML content html_content = requests.get(url).text # Parse HTML code for the entire site soup = BeautifulSoup(html_content, "lxml") gdp = soup.find_all("table", attrs={"class": "adoption-status-table"}) print("Number of tables on site: ",len(gdp)) table1 = gdp[0] body = table1.find_all("tr") head = body[0] body_rows = body[1:] headings = [] for item in head.find_all("th"): item = (item.text).rstrip("\n") headings.append(item) print(headings) all_rows = [] for row_num in range(len(body_rows)): row = [] for row_item in body_rows[row_num].find_all("td"): aa = re.sub("(\xa0)|(\n)|,","",row_item.text) row.append(aa) all_rows.append(row) df = pd.DataFrame(data=all_rows,columns=headings)

2条回答

网友

1楼 · 编辑于 2024-04-26 17:59:59

上面的答案很好，另一个选项是使用pandas.read_html将表提取到数据帧中，并使用lxmlxpath填充缺少的Status列（如果愿意，也可以使用beautifulsoup，但它更详细）：

import pandas as pd
import requests
from lxml import html

r = requests.get("https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay")
table = pd.read_html(r.content)[0]
tree = html.fromstring(r.content)
table["Status"] = [True if t.xpath("img") else False for t in tree.xpath('//table/tbody/tr/td[2]')]
print(table)

Try this on repl.it

网友

2楼 · 编辑于 2024-04-26 17:59:59

您需要在td中查找img元素。以下是一个例子：

data = []
for tr in body_rows:
    cells = tr.find_all('td')
    img = cells[1].find('img')
    if img and img['src'] == '/images/icons/tick.png':
        status = True
    else:
        status = False
    
    data.append({
        'Extent of IFRS application': cells[0].string,
        'Status': status,
        'Additional Information': cells[2].string,
    })

print(pd.DataFrame(data).head())

相关问题更多 >

编程相关推荐

热门问题

热门文章