将列表附加到列

2024-03-28 19:14:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要将列表idlist附加到表中名为EventID的列。列表需要按顺序追加,因为我从原始HTML文件中按顺序获取了ID。你知道吗

现在我的输出如下所示:

     EventID                   EventDate                                          EventName  AmntTickets              PriceRange
0  103577924  Thu, 10/11/2018  8:20 p.m.  Philadelphia Eagles at New York Giants  MetLif...         6655  $134.50  to  $2,222.50
1  103577924  Thu, 10/11/2018  8:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...          929   $20.39  to  $3,602.50
     EventID                   EventDate                                          EventName  AmntTickets              PriceRange
0  103577925  Thu, 10/11/2018  8:20 p.m.  Philadelphia Eagles at New York Giants  MetLif...         6655  $134.50  to  $2,222.50
1  103577925  Thu, 10/11/2018  8:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...          929   $20.39  to  $3,602.50

我需要它看起来像这样:

     EventID                   EventDate                                          EventName  AmntTickets              PriceRange
0  103577924  Thu, 10/11/2018  8:20 p.m.  Philadelphia Eagles at New York Giants  MetLif...         6655  $134.50  to  $2,222.50
1  103577925  Thu, 10/11/2018  8:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...          929   $20.39  to  $3,602.50

我的代码:

import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
import pprint
import re

with open("htmltabletest.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    dfs = pd.read_html(soup.prettify())
    df = dfs[0]
    dfz=df.rename(columns = {'Event date  Time (local)':'EventDate'}).rename(columns = {'Event name  Venue':'EventName'}).rename(columns = {'Tickets  listed':'AmntTickets'}).rename(columns = {'Price  range':'PriceRange'}).rename(columns = {'Unnamed: 0':'EventID'})
    idlist = []
    for se in soup.find_all('span', id=re.compile(r'min')):
        se = (str(se))
        seeme1 = se.replace('<span id="se-','')
        seeme, sep, tail = seeme1.partition('-')
        idlist.append(seeme)
    for p in idlist:
        dfz = dfz.assign(EventID=p)
        print(dfz)

我的html文件(htmltabletest.html文件)地址:

<table class="dataTable st-alternateRows" id="eventSearchTable">
<thead>
<tr>
<th id="th-es-rb"><div class="dt-th"> </div></th>
<th id="th-es-ed"><div class="dt-th"><span class="th-divider"> </span>Event date<br/>Time (local)</div></th>
<th id="th-es-en"><div class="dt-th"><span class="th-divider"> </span>Event name<br/>Venue</div></th>
<th id="th-es-ti"><div class="dt-th"><span class="th-divider"> </span>Tickets<br/>listed</div></th>
<th id="th-es-pr"><div class="dt-th es-lastCell"><span class="th-divider"> </span>Price<br/>range</div></th>
</tr>
</thead>
<tbody class="" id="eventSearchTbody"><tr class="even" id="r-se-103577924">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577924-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577924-eventDateTime">Thu, 10/11/2018<br/>8:20 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577924&amp;sectionId=0" id="se-103577924-eventName" target="_blank">Philadelphia Eagles at New York Giants</a></div><div id="se-103577924-venue">MetLife Stadium, East Rutherford, NJ</div></td>
<td id="se-103577924-nrTickets">6655</td>
<td class="es-lastCell nowrap" id="se-103577924-priceRange"><span id="se-103577924-minPrice">$134.50</span>  to<br/><span id="se-103577924-maxPrice">$2,222.50</span></td>
</tr><tr class="odd" id="r-se-103577925">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577925-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577925-eventDateTime">Thu, 10/11/2018<br/>8:21 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577925&amp;sectionId=0" id="se-103577925-eventName" target="_blank">PARKING PASSES ONLY Philadelphia Eagles at New York Giants</a></div><div id="se-103577925-venue">MetLife Stadium Parking Lots, East Rutherford, NJ</div></td>
<td id="se-103577925-nrTickets">929</td>
<td class="es-lastCell nowrap" id="se-103577925-priceRange"><span id="se-103577925-minPrice">$20.39</span>  to<br/><span id="se-103577925-maxPrice">$3,602.50</span></td>
</tr></tbody>
</table>

Tags: tobrdividnewesatclass
1条回答
网友
1楼 · 发布于 2024-03-28 19:14:16

如果dfz数据帧的长度等于列表的长度,idlist。你知道吗

可以完全删除最后一个for循环。相反,你可以使用

dfz["EventID"] = idlist

import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
import pprint
import re

with open("testfile.html") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    dfs = pd.read_html(soup.prettify())
    df = dfs[0]
    dfz=df.rename(columns = {'Event date  Time (local)':'EventDate'}).rename(columns = {'Event name  Venue':'EventName'}).rename(columns = {'Tickets  listed':'AmntTickets'}).rename(columns = {'Price  range':'PriceRange'}).rename(columns = {'Unnamed: 0':'EventID'})
    idlist = []
    for se in soup.find_all('span', id=re.compile(r'min')):
        se = (str(se))
        seeme1 = se.replace('<span id="se-','')
        seeme, sep, tail = seeme1.partition('-')
        idlist.append(seeme)
    dfz["EventID"] = idlist
    print(dfz)

然后您将获得您请求的数据帧。你知道吗

     EventID                   EventDate                                          EventName  AmntTickets              PriceRange
0  103577924  Thu, 10/11/2018  8:20 p.m.  Philadelphia Eagles at New York Giants  MetLif...         6655  $134.50  to  $2,222.50
1  103577925  Thu, 10/11/2018  8:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...          929   $20.39  to  $3,602.50

如果dataframe dfz和list idlist的长度不相等。您可以使用下面的代码为长度不等的列表追加数据。你知道吗

import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
import pprint
import re

with open("testfile.html") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    dfs = pd.read_html(soup.prettify())
    df = dfs[0]
    dfz=df.rename(columns = {'Event date  Time (local)':'EventDate'}).rename(columns = {'Event name  Venue':'EventName'}).rename(columns = {'Tickets  listed':'AmntTickets'}).rename(columns = {'Price  range':'PriceRange'}).rename(columns = {'Unnamed: 0':'EventID'})
    idlist = []
    for se in soup.find_all('span', id=re.compile(r'min')):
        se = (str(se))
        seeme1 = se.replace('<span id="se-','')
        seeme, sep, tail = seeme1.partition('-')
        idlist.append(seeme)

    for ind, row in dfz.iterrows():
        try:
            dfz.EventID.iloc[ind] = idlist[ind]
        except Exception as e:
            pass
    print(dfz)

相关问题 更多 >