从网址获取数据并放入数据框中

2024-04-26 00:52:07 发布

您现在位置:Python中文网/ 问答频道 /正文

大家好,我目前正试图从URL中获取一些数据,然后试图预测该文章应该属于哪一类。 到目前为止,我已经这样做了,但它有一个错误:

    info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
    html, category = [], []
    for i in info.index:
        response = requests.get(info.iloc[i,0])
        soup = BeautifulSoup(response.text, 'html.parser')
        html.append([re.sub(r'<.*?>','', 
                      str(soup.findAll(['p','h1','\href="/avtorji/'])))])
        category.append(info.iloc[0,i])

    data = pd.DataFrame()
    data['html'] = html
    data['category'] = category

错误是:

IndexError: single positional indexer is out-of-bounds.

有人能帮帮我吗?在


Tags: 数据infourlreaddataresponsehtml错误
2条回答

错误很可能是由于将索引传递给ilocloc需要索引值和列名,而iloc需要行和列的数字位置。此外,您已经将category的行和列位置与category.append(info.iloc[0,i])互换。所以你至少应该:

for i in range(len(info)):
    response = requests.get(info.iloc[i,0])
    ...
    category.append(info.iloc[i,0])

但是当您尝试迭代dataframe的第一列时,上面的代码不是python代码。最好直接使用该列:

^{pr2}$

您可以避免iloc调用,而使用iterrows,我认为您必须使用loc而不是{},因为您是在索引上操作的,但是在循环中使用iloc和{}通常没有那么有效。您可以尝试以下代码(插入等待时间):

import time

info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
    url= row.iloc[0]
    time.sleep(2.5)  # wait 2.5 seconds
    response = requests.get(url)  # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
    soup = BeautifulSoup(response.text, 'html.parser')
    html.append([re.sub(r'<.*?>','', 
                  str(soup.findAll(['p','h1','\href="/avtorji/'])))])
    # the following iloc was probably raising the error, because you access the ith column in the first row of your df
    # category.append(info.iloc[0,i])
    category.append(row.iloc[0])  # not sure which field you wanted to access here, you should also replace it by row['name']

data = pd.DataFrame()
data['html'] = html
data['category'] = category

如果您真的只需要循环中的url,请替换:

^{pr2}$

通过类似于:

for url in info[put_the_name_of_the_url_column_here]: # or info.iloc[:,0] as proposed by serge

相关问题 更多 >