实际上,我想从一个网站“https://www.crunchbase.com/organization/ani-technologies#/entity”中获取数据,因为我的数据存在于dt和dd标记中,而且网站上不允许bot。所以我保存了页面,并通过这种方式在保存的页面上应用了beauthoulsoup模块,尽管我在代码中提到了实际的url
soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read())
Actual Output:
0 3 Acquisitions 1 None 2 Bengaluru, Karnataka 3 Ola is a mobile app for cab booking in India. 4 None 5 None 6 olacab link 7 None 8 December 3, 2010 9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 10 media@olacabs.com 11 None
在这里的几个地方,我没有得到一个事实,因为有超链接到目录e、 g.在“https://www.crunchbase.com/organization/ani-technologies#/entity”页面上,类别选项卡有5个类别,分别命名为:电子商务、互联网、交通、应用和移动,每一个都连接到一个超链接,因此我无法获得我想要的文本,即这5个类别。在
What I want as output as:
0 3 Acquisitions 1 (All that text (though not important to me)) 2 Bengaluru, Karnataka 3 Ola is a mobile app for cab booking in India. 4 (all that text(though not important to me)) ==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important) 6 olacab link 7 (all that text(though not important to me)) 8 December 3, 2010 9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 10 media@olacabs.com 11 (all that text(though not important to me))
如果我能得到像这样的字典会很有帮助:
{"Headquarters":["Bengaluru,Karnataka"],
"Description":["Ola is a mobile app for cab booking in India."],
"Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]}
从所有的},以聚合成
<dd><a href=...>text</dd>
获取{dict
,例如:^{pr2}$
测试Python:3.4.2-bs4:4.6.0
相关问题 更多 >
编程相关推荐