使用pandas来抓取HTML：它可以用来抓取网页中的表吗？

Browse SOURCE in AHTPDB This page gives statis... 1 Browse SOURCE in AHTPDB 2 This page gives statistics of SOURCE fields an... 3 Following table enlists the number of entries ... 4 Following table enlists the number of entries ... 5 Milk 6 834 7 google.load("visualization", "1", {packages:["... 1 \ 0 Browse SOURCE in AHTPDB 1 NaN 2 NaN 3 Following table enlists the number of entries ... 4 NaN 5 Casein 6 723 7 NaN 2 \ 0 This page gives statistics of SOURCE fields an... 1 NaN 2 NaN 3 Milk 4 NaN 5 Bovine 6 477 7 NaN 3 \ 0 Following table enlists the number of entries ... 1 NaN 2 NaN 3 Casein 4 NaN 5 Cereals 6 419 7 NaN 4 5 6 \ 0 Following table enlists the number of entries ... Milk Casein 1 NaN NaN NaN 2 NaN NaN NaN 3 Bovine Cereals Fish 4 NaN NaN NaN 5 Fish Pork Human 6 384 333 215 7 NaN NaN NaN 7 8 9 \ 0 Bovine Cereals Fish 1 NaN NaN NaN 2 NaN NaN NaN 3 Pork Human Chicken 4 NaN NaN NaN 5 Chicken Soybean Egg 6 177 159 97 7 NaN NaN NaN ... 16 17 18 \ 0 ... 723.0 477.0 419.0 1 ... NaN NaN NaN 2 ... NaN NaN NaN 3 ... 384.0 333.0 215.0 4 ... NaN NaN NaN 5 ... NaN NaN NaN 6 ... NaN NaN NaN 7 ... NaN NaN NaN 19 20 21 22 23 24 \ 0 384.0 333.0 215.0 177.0 159.0 97.0 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN 3 177.0 159.0 97.0 NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN NaN NaN 6 NaN NaN NaN NaN NaN NaN 7 NaN NaN NaN NaN NaN NaN 25 0 google.load("visualization", "1", {packages:["... 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN [8 rows x 26 columns] 0 0 Browse SOURCE in AHTPDB 0 0 This page gives statistics of SOURCE fields an... 0 \ 0 Following table enlists the number of entries ... 1 Following table enlists the number of entries ... 2 Milk 3 834 4 google.load("visualization", "1", {packages:["... 1 2 3 4 \ 0 Following table enlists the number of entries ... Milk Casein Bovine 1 NaN NaN NaN NaN 2 Casein Bovine Cereals Fish 3 723 477 419 384 4 NaN NaN NaN NaN 5 6 7 8 9 ... 12 13 14 \ 0 Cereals Fish Pork Human Chicken ... 834.0 723.0 477.0 1 NaN NaN NaN NaN NaN ... NaN NaN NaN 2 Pork Human Chicken Soybean Egg ... NaN NaN NaN 3 333 215 177 159 97 ... NaN NaN NaN 4 NaN NaN NaN NaN NaN ... NaN NaN NaN 15 16 17 18 19 20 21 0 419.0 384.0 333.0 215.0 177.0 159.0 97.0 1 NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN

url = 'http://crdd.osdd.net/raghava/ahtpdb/display.php?details=1001' html = urllib.urlopen(url).read() bs = BeautifulSoup(html, 'lxml') tab = bs.find("table",{"class":"tab"}) data = [] rows = bs.find_all('tr') for row in rows: cols = row.find_all('td') cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) print data

1条回答

网友

1楼 · 发布于 2024-04-19 01:05:45

你应该在桌子的位置上玩一玩。例如：我以您提供的网站为例，在那里找到一个表（url）。然后我试了你试过的那段代码，但有一个小改动：

url = "http://crdd.osdd.net/raghava/ahtpdb/srcbr.php"
tables = pd.read_html(url)
print tables[4]

我得到的表刚刚好（与标题-没有问题，以删除它以后）。你知道吗

原因是，在您复制的示例代码中，只有一个表（或者多个表，并且他们需要的表是第一个表）。这就是为什么table[0]给了他们想要的桌子。在我在这里展示的例子中，网站使用表进行布局，第一个表不是您想要得到的表（在这种情况下是第五个表-这就是为什么table[4]在这种情况下可以工作的原因）

注意：您可能希望将其保存到csv，以便更易于阅读：

url = "http://crdd.osdd.net/raghava/ahtpdb/srcbr.php"
tables = pd.read_html(url)
tables[4].to_csv("path/to/file.csv")

根据您的信息，请尝试以下操作：

from bs4 import BeautifulSoup
import urllib.request

url = 'http://crdd.osdd.net/raghava/ahtpdb/display.php?details=1001'
html = urllib.request.urlopen(url).read()
bs = BeautifulSoup(html)
tab = bs.find("table",{"class":"tab"})
print(tab)

您将需要清理它，但是表的所有数据都应该在那里可用。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章