在dt-dd标记中抓取数据,其中包含链接

2024-04-26 07:30:18 发布

您现在位置:Python中文网/ 问答频道 /正文

实际上,我想从一个网站“https://www.crunchbase.com/organization/ani-technologies#/entity”中获取数据,因为我的数据存在于dt和dd标记中,而且网站上不允许bot。所以我保存了页面,并通过这种方式在保存的页面上应用了beauthoulsoup模块,尽管我在代码中提到了实际的url

soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read())

^{pr2}$

Actual Output:

0 3 Acquisitions
1 None
2 Bengaluru, Karnataka
3 Ola is a mobile app for cab booking in India.
4 None
5 None
6 olacab link
7 None
8 December 3, 2010
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
10 media@olacabs.com
11 None

在这里的几个地方,我没有得到一个事实,因为有超链接到目录e、 g.在“https://www.crunchbase.com/organization/ani-technologies#/entity”页面上,类别选项卡有5个类别,分别命名为:电子商务、互联网、交通、应用和移动,每一个都连接到一个超链接,因此我无法获得我想要的文本,即这5个类别。在

What I want as output as:

0 3 Acquisitions
1 (All that text (though not important to me))
2 Bengaluru, Karnataka
3 Ola is a mobile app for cab booking in India.
4 (all that text(though not important to me))
==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important)
6 olacab link
7 (all that text(though not important to me))
8 December 3, 2010
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
10 media@olacabs.com
11 (all that text(though not important to me))

如果我能得到像这样的字典会很有帮助:

{"Headquarters":["Bengaluru,Karnataka"],
 "Description":["Ola is a mobile app for cab booking in India."],
 "Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]}

Tags: totextcomnonethatisnot页面
1条回答
网友
1楼 · 发布于 2024-04-26 07:30:18

Question: ... I cant get the text that I want ... if I can get dictionary ...

从所有的<dd><a href=...>text</dd>获取{},以聚合成dict,例如:

from collections import OrderedDict
os_dict = OrderedDict()

for div_class in ['definition-list-container', 'details definition-list']:
    divs = soup.find_all("div", class_=div_class)
    key = '?'
    for div in divs:
        for child in div.findChildren():
            if child.name == 'dt':
                key = child.text[:-1]
            if child.name == 'dd':
                if child.select('a[href]'):
                    a_list = child.find_all("a")
                    if key in ['Social:']:
                        os_dict[key] = [a['href'] for a in a_list]
                    elif len(a_list) == 1:
                        os_dict[key] = a_list[0].text
                    else:
                        os_dict[key] = [a.text for a in a_list]
                else:
                    os_dict[key] = child.text

for n, key in enumerate(os_dict, 1):
    print('{:>2}: {:>20}:\t{}'.format(n, key, os_dict[key]))

Outuput:

 1:          Acquisition:   3 Acquisitions
 2:  Total Equity Fundin:   ['11 Rounds', '24 Investors']
 3:         Headquarters:   Bengaluru, Karnataka
 4:          Description:   Ola is a mobile app for cab booking in India.
 5:             Founders:   ['Bhavish Aggarwal', 'Ankit Bhati']
 6:           Categories:   ['E-Commerce', 'Internet', 'Transportation', 'Apps', 'Mobile']
 7:              Website:   http://www.olacabs.com
 8:              Social::   ['http://www.facebook.com/olacabs', 'http://twitter.com/olacabs', 'http://www.linkedin.com/company/olacabs-com']
 9:              Founded:   December 3, 2010
10:              Aliases:   ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
11:              Contact:   media@olacabs.com
12:            Employees:   8 in Crunchbase

Beautiful Soup Documentation: find-all
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

^{pr2}$

测试Python:3.4.2-bs4:4.6.0

相关问题 更多 >