字符串中包含的HTML到DataFram

2024-05-13 19:03:50 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有一个字符串,看起来像这样:

stuff = "<table><tr><td>Tuesday, January 15, 2019</td><td>2:44 PM EST</td><td>12</td><td>$530</td></tr><tr><td>Thursday, January 3, 2019</td><td>11:55 PM EST</td><td>11.5</td><td>$821</td></tr><tr><td>Friday, December 7, 2018</td><td>2:49 AM EST</td><td>11</td><td>$800</td></tr><tr><td>Wednesday, November 28, 2018</td><td>11:49 AM EST</td><td>9.5</td><td>$487</td></tr><tr><td>Monday, November 26, 2018</td><td>10:25 AM EST</td><td>11</td><td>$650</td></tr><tr><td>Thursday, November 22, 2018</td><td>5:52 PM EST</td><td>8.5</td><td>$792</td></tr><tr><td>Thursday, November 8, 2018</td><td>3:42 PM EST</td><td>11.5</td><td>$600</td></tr><tr><td>Saturday, September 29, 2018</td><td>9:40 PM EST</td><td>10</td><td>$470</td></tr><tr><td>Tuesday, September 4, 2018</td><td>4:11 PM EST</td><td>9.5</td><td>$649</td></tr><tr><td>Friday, July 13, 2018</td><td>2:07 PM EST</td><td>8</td><td>$650</td></tr><tr><td>Friday, July 6, 2018</td><td>1:21 PM EST</td><td>12</td><td>$495</td></tr><tr><td>Wednesday, June 13, 2018</td><td>5:14 PM EST</td><td>10</td><td>$450</td></tr><tr><td>Monday, June 4, 2018</td><td>4:24 PM EST</td><td>9.5</td><td>$476</td></tr><tr><td>Friday, April 13, 2018</td><td>9:16 AM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Monday, March 5, 2018</td><td>7:23 AM EST</td><td>8.5</td><td>$560</td></tr><tr><td>Thursday, January 11, 2018</td><td>1:40 PM EST</td><td>12</td><td>$800</td></tr><tr><td>Saturday, January 6, 2018</td><td>3:13 PM EST</td><td>9</td><td>$600</td></tr><tr><td>Thursday, December 14, 2017</td><td>1:06 PM EST</td><td>7.5</td><td>$726</td></tr><tr><td>Thursday, November 9, 2017</td><td>6:10 PM EST</td><td>10.5</td><td>$601</td></tr><tr><td>Wednesday, September 20, 2017</td><td>9:40 AM EST</td><td>10.5</td><td>$850</td></tr><tr><td>Friday, July 6, 2018</td><td>1:21 PM EST</td><td>12</td><td>$495</td></tr><tr><td>Wednesday, June 13, 2018</td><td>5:14 PM EST</td><td>10</td><td>$450</td></tr><tr><td>Monday, June 4, 2018</td><td>4:24 PM EST</td><td>9.5</td><td>$476</td></tr><tr><td>Friday, April 13, 2018</td><td>9:16 AM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Monday, March 5, 2018</td><td>7:23 AM EST</td><td>8.5</td><td>$560</td></tr><tr><td>Thursday, January 11, 2018</td><td>1:40 PM EST</td><td>12</td><td>$800</td></tr><tr><td>Saturday, January 6, 2018</td><td>3:13 PM EST</td><td>9</td><td>$600</td></tr><tr><td>Thursday, December 14, 2017</td><td>1:06 PM EST</td><td>7.5</td><td>$726</td></tr><tr><td>Thursday, November 9, 2017</td><td>6:10 PM EST</td><td>10.5</td><td>$601</td></tr><tr><td>Wednesday, September 20, 2017</td><td>9:40 AM EST</td><td>10.5</td><td>$850</td></tr><tr><td>Monday, July 24, 2017</td><td>12:22 PM EST</td><td>10.5</td><td>$600</td></tr><tr><td>Saturday, June 17, 2017</td><td>7:54 AM EST</td><td>11</td><td>$550</td></tr><tr><td>Saturday, June 10, 2017</td><td>7:32 PM EST</td><td>7.5</td><td>$750</td></tr><tr><td>Wednesday, May 24, 2017</td><td>3:10 PM EST</td><td>11</td><td>$741</td></tr><tr><td>Sunday, May 14, 2017</td><td>4:34 AM EST</td><td>10.5</td><td>$750</td></tr><tr><td>Monday, April 17, 2017</td><td>8:45 AM EST</td><td>10.5</td><td>$750</td></tr><tr><td>Saturday, April 1, 2017</td><td>9:44 PM EST</td><td>11</td><td>$750</td></tr><tr><td>Thursday, March 2, 2017</td><td>4:05 PM EST</td><td>11</td><td>$970</td></tr><tr><td>Thursday, February 23, 2017</td><td>3:03 PM EST</td><td>11.5</td><td>$675</td></tr><tr><td>Monday, January 23, 2017</td><td>3:29 PM EST</td><td>11</td><td>$726</td></tr><tr><td>Sunday, January 22, 2017</td><td>6:47 PM EST</td><td>11</td><td>$655</td></tr><tr><td>Friday, December 9, 2016</td><td>2:38 AM EST</td><td>10</td><td>$575</td></tr><tr><td>Thursday, December 8, 2016</td><td>5:23 PM EST</td><td>11.5</td><td>$1,200</td></tr><tr><td>Thursday, December 8, 2016</td><td>8:29 AM EST</td><td>12</td><td>$946</td></tr><tr><td>Saturday, November 26, 2016</td><td>3:09 PM EST</td><td>12</td><td>$1,031</td></tr><tr><td>Wednesday, November 23, 2016</td><td>3:45 PM EST</td><td>7.5</td><td>$650</td></tr><tr><td>Monday, November 21, 2016</td><td>7:23 AM EST</td><td>11</td><td>$1,031</td></tr><tr><td>Friday, November 18, 2016</td><td>5:12 PM EST</td><td>11</td><td>$1,031</td></tr><tr><td>Thursday, November 17, 2016</td><td>9:11 AM EST</td><td>11</td><td>$660</td></tr><tr><td>Tuesday, November 8, 2016</td><td>7:17 AM EST</td><td>6.5</td><td>$777</td></tr><tr><td>Saturday, September 24, 2016</td><td>5:57 PM EST</td><td>8</td><td>$815</td></tr><tr><td>Thursday, August 25, 2016</td><td>3:52 PM EST</td><td>6.5</td><td>$750</td></tr><tr><td>Saturday, August 20, 2016</td><td>2:20 PM EST</td><td>10.5</td><td>$721</td></tr><tr><td>Saturday, August 20, 2016</td><td>1:39 PM EST</td><td>8</td><td>$721</td></tr><tr><td>Thursday, July 21, 2016</td><td>1:21 PM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Wednesday, July 20, 2016</td><td>6:14 AM EST</td><td>7.5</td><td>$777</td></tr><tr><td>Saturday, June 25, 2016</td><td>10:00 AM EST</td><td>9.5</td><td>$950</td></tr><tr><td>Thursday, June 23, 2016</td><td>5:26 PM EST</td><td>10.5</td><td>$580</td></tr><tr><td>Tuesday, June 21, 2016</td><td>1:19 PM EST</td><td>12.5</td><td>$600</td></tr><tr><td>Tuesday, May 31, 2016</td><td>10:06 AM EST</td><td>9.5</td><td>$828</td></tr></table>"

我如何在Pandas中使用类似.read_html()的东西来理解这个字符串

我是通过使用硒来获得的:

stuff = html_table.get_attribute('innerHTML')

我必须这样做,因为有一堆JavaScript阻止我访问东西


Tags: amjulytrtdestmondayfridaypm
1条回答
网友
1楼 · 发布于 2024-05-13 19:03:50

使用^{}什么返回DataFrame的列表,所以需要先通过索引选择:

df = pd.read_html(stuff)[0]
print (df.head())
                              0             1     2     3
0     Tuesday, January 15, 2019   2:44 PM EST  12.0  $530
1     Thursday, January 3, 2019  11:55 PM EST  11.5  $821
2      Friday, December 7, 2018   2:49 AM EST  11.0  $800
3  Wednesday, November 28, 2018  11:49 AM EST   9.5  $487
4     Monday, November 26, 2018  10:25 AM EST  11.0  $650

如果有必要,可以进行一些数据清理:

df.columns = ['date','time','val1','val2']

df['date'] = pd.to_datetime(df['date'] + '-' + df.pop('time').str[:-4], 
                            format='%A, %B %d, %Y-%I:%M %p')

df['val2'] = df['val2'].replace(['\$',','],'', regex=True).astype(int)
print (df.head())
                 date  val1  val2
0 2019-01-15 14:44:00  12.0   530
1 2019-01-03 23:55:00  11.5   821
2 2018-12-07 02:49:00  11.0   800
3 2018-11-28 11:49:00   9.5   487
4 2018-11-26 10:25:00  11.0   650

相关问题 更多 >