如何从早期的<strong>标记中提取HTML表并添加具有常量值的新列?

2024-04-23 09:48:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个HTML文档中提取一系列表,并从用作标题的标记中添加一个带有常量值的新列。然后,我们的想法是使这个新的三列表成为一个数据框架。下面是到目前为止我已经想到的代码。也就是说,每个表将有第三列,其中所有行值将等于AGO、DPK、ATK或PMS,具体取决于该系列表之前的标题。非常感谢您的帮助,因为我是python和HTML新手。谢谢你的帮助

import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser

br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")

table = br.find_all('td', class_='vc_table_cell')

for element in table:
    data = element.find('span', class_='vc_table_content')
    prod_name = br.find_all('strong')
    ago = prod_name[0].text
    dpk = prod_name[1].text
    atk = prod_name[2].text
    pms = prod_name[3].text
    if br.find('strong').text == ago:
        data.append(ago.text)
    elif br.find('strong').text == dpk:
        data.append(dpk.text)
    elif br.find('strong').text == atk:
        data.append(atk.text)
    elif br.find('strong').text == pms:
        data.append(pms.text)
    print(data.text)

df = pd.DataFrame(data)

The result i'm hoping for is to go from this

                AGO

Enterprise     Price
Coy A          $0.5/L
Coy B          $0.6/L
Coy C          $0.7/L

to the new table below as a dataframe in Pandas

Enterprise     Price            Product
Coy A          $0.5/L           AGO
Coy B          $0.6/L           AGO
Coy C          $0.7/L           AGO

and to repeat the same thing for other tables with DPK, ATK and PMS information

Tags: textnamefrombrimportfordatatable
1条回答
网友
1楼 · 发布于 2024-04-23 09:48:05

我希望我正确理解了你的问题。此脚本将把页面中找到的所有表刮到数据框中,并将其保存到csv文件:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://oilpriceng.net/03-09-2019/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
    if tag.name == 'strong':
        last = tag.get_text(strip=True)
    else:
        a, b = tag.select('td')
        a, b = a.get_text(strip=True), b.get_text(strip=True)
        if a and b != 'DEPOT PRICE':
            data['Enterprise'].append(a)
            data['Price'].append(b)
            data['Product'].append(last)

df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')

印刷品:

            Enterprise         Price Product
0            AVIDOR PH        ₦190.0     AGO
1            SHORELINK                   AGO
2    BULK STRATEGIC PH        ₦190.0     AGO
3                  TSL                   AGO
4              MASTERS                   AGO
..                 ...           ...     ...
165             CHIPET        ₦132.0     PMS
166               BOND                   PMS
167           RAIN OIL                   PMS
168               MENJ        ₦133.0     PMS
169              NIPCO  ₦ 2,9000,000     LPG

[170 rows x 3 columns]

{}(LibreOffice的屏幕截图):

enter image description here

相关问题 更多 >