从网址获取数据并放入数据框中

info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None) html, category = [], [] for i in info.index: response = requests.get(info.iloc[i,0]) soup = BeautifulSoup(response.text, 'html.parser') html.append([re.sub(r'<.*?>','', str(soup.findAll(['p','h1','\href="/avtorji/'])))]) category.append(info.iloc[0,i]) data = pd.DataFrame() data['html'] = html data['category'] = category

2条回答

网友

1楼 · 编辑于 2024-04-26 00:52:07

错误很可能是由于将索引传递给iloc：loc需要索引值和列名，而iloc需要行和列的数字位置。此外，您已经将category的行和列位置与category.append(info.iloc[0,i])互换。所以你至少应该：

for i in range(len(info)):
    response = requests.get(info.iloc[i,0])
    ...
    category.append(info.iloc[i,0])

但是当您尝试迭代dataframe的第一列时，上面的代码不是python代码。最好直接使用该列：

^{pr2}$

网友

2楼 · 编辑于 2024-04-26 00:52:07

您可以避免iloc调用，而使用iterrows，我认为您必须使用loc而不是{}，因为您是在索引上操作的，但是在循环中使用iloc和{}通常没有那么有效。您可以尝试以下代码（插入等待时间）：

import time

info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
    url= row.iloc[0]
    time.sleep(2.5)  # wait 2.5 seconds
    response = requests.get(url)  # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
    soup = BeautifulSoup(response.text, 'html.parser')
    html.append([re.sub(r'<.*?>','', 
                  str(soup.findAll(['p','h1','\href="/avtorji/'])))])
    # the following iloc was probably raising the error, because you access the ith column in the first row of your df
    # category.append(info.iloc[0,i])
    category.append(row.iloc[0])  # not sure which field you wanted to access here, you should also replace it by row['name']

data = pd.DataFrame()
data['html'] = html
data['category'] = category

如果您真的只需要循环中的url，请替换：

^{pr2}$

通过类似于：

for url in info[put_the_name_of_the_url_column_here]: # or info.iloc[:,0] as proposed by serge

相关问题更多 >

编程相关推荐

热门问题

热门文章