在dt-dd标记中抓取数据，其中包含链接

0 3 Acquisitions 1 None 2 Bengaluru, Karnataka 3 Ola is a mobile app for cab booking in India. 4 None 5 None 6 olacab link 7 None 8 December 3, 2010 9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 10 media@olacabs.com 11 None

0 3 Acquisitions 1 (All that text (though not important to me)) 2 Bengaluru, Karnataka 3 Ola is a mobile app for cab booking in India. 4 (all that text(though not important to me)) ==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important) 6 olacab link 7 (all that text(though not important to me)) 8 December 3, 2010 9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 10 media@olacabs.com 11 (all that text(though not important to me))

1条回答

网友

1楼 · 发布于 2024-06-16 15:21:51

Question: ... I cant get the text that I want ... if I can get dictionary ...

从所有的<dd><a href=...>text</dd>获取{}，以聚合成dict，例如：

from collections import OrderedDict
os_dict = OrderedDict()

for div_class in ['definition-list-container', 'details definition-list']:
    divs = soup.find_all("div", class_=div_class)
    key = '?'
    for div in divs:
        for child in div.findChildren():
            if child.name == 'dt':
                key = child.text[:-1]
            if child.name == 'dd':
                if child.select('a[href]'):
                    a_list = child.find_all("a")
                    if key in ['Social:']:
                        os_dict[key] = [a['href'] for a in a_list]
                    elif len(a_list) == 1:
                        os_dict[key] = a_list[0].text
                    else:
                        os_dict[key] = [a.text for a in a_list]
                else:
                    os_dict[key] = child.text

for n, key in enumerate(os_dict, 1):
    print('{:>2}: {:>20}:\t{}'.format(n, key, os_dict[key]))

Outuput:

 1:          Acquisition:   3 Acquisitions
 2:  Total Equity Fundin:   ['11 Rounds', '24 Investors']
 3:         Headquarters:   Bengaluru, Karnataka
 4:          Description:   Ola is a mobile app for cab booking in India.
 5:             Founders:   ['Bhavish Aggarwal', 'Ankit Bhati']
 6:           Categories:   ['E-Commerce', 'Internet', 'Transportation', 'Apps', 'Mobile']
 7:              Website:   http://www.olacabs.com
 8:              Social::   ['http://www.facebook.com/olacabs', 'http://twitter.com/olacabs', 'http://www.linkedin.com/company/olacabs-com']
 9:              Founded:   December 3, 2010
10:              Aliases:   ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
11:              Contact:   media@olacabs.com
12:            Employees:   8 in Crunchbase

Beautiful Soup Documentation: find-all
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

^{pr2}$

测试Python:3.4.2-bs4:4.6.0

相关问题更多 >

编程相关推荐

热门问题

热门文章