从第二个位置刮取文本<BR>

2024-04-25 07:54:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我有下面的html摘录,注意,两个td对我需要捕获的每一行重复

<table class="ent">
<tbody class=""><tr class="tablestyle">

    <td class="hide_on_mobile">  <a href="../" class="">
        <img class="ProductImage" src="https://.."></a>
    </td>
    <td class="hide_on_mobile" align="center">
        <strong class="">
            <span style="font-size:1.4em;" class="">Scraped okay - col0</span>
                <br>
                <br>Scrape this text - col1</strong><br>
                <br><i><span style="color:indigo;" class="">Scrape this text - col2
                <br class="">
                <br>Next Event: Scrape this text -col3</span></i>
    </td>

我需要捕获4个不同的数据块col0,col1,col2,col3

我已经让col0工作了。我要抓到col1,col2,col3

我在试着用BR I.e 跨距后

将第2个BR后面的文本作为第1列

将第3个BR后面的文本作为第2列

把第5行后面的文字改为第3行

我无法让col1与br>;比尔。有什么办法解决这个问题吗

import sqlite3
import datetime
import requestsnt
import pandas as pd
from bs4 import BeautifulSoup

url = "http:/*"

r = requests.get(url)
source = r.text
t = datetime.datetime.now().date()
soup = BeautifulSoup(source, "lxml")

row_count=200

row_marker = 0

new_table = pd.DataFrame(columns = ["col0", "col1", "col2","col3", "DateAdded"], index = range(0,row_count)) # I don't know the number of rows

# For col0
column_marker = 0
for layout in soup.select("strong > span"):
            new_table.iat[row_marker,column_marker] = layout.text.strip()
            new_table.iat[row_marker,4] = t
            row_marker +=1

# For col 1

column_marker = 1
row_marker = 0
for layout in soup.select("strong > span > br > br"):
            new_table.iat[row_marker,column_marker] = layout.text.strip()
            row_marker +=1

Tags: textbrimportnewtablemarkerclasscol2
1条回答
网友
1楼 · 发布于 2024-04-25 07:54:53
#since you said there are multiple trs
trs = data.find_all('tr')


for tr in trs:
    l = []
    td =  tr.find_all('td')
    #since first td will never have data.. acc to the above posted ques 
    for tags in td[1]:
        try:
            if tags.text:
                print(tags.text)
                l.extend((tags.text).split('\n'))
        except:
            pass

#once there are more trs keep below code inside the loop
#then store the data in a df..since each loop will give new list
str_data = [' '.join(s.split()) for s in l if s]        
str_data.remove('')
print(str_data)

输出

['Scraped okay - col0',
 'Scrape this text - col1',
 'Scrape this text - col2',
 'Next Event: Scrape this text -col3']

相关问题 更多 >