BeautifulSoup.text方法返回不带分隔符的文本（\n、\r等）

import urllib from BeautifulSoup import BeautifulSoup import vkontakte vk = vkontakte.API(token=<SECRET_TOKEN>) audios = vk.getAudios(count='2') #{u'artist': u'The Beatles', u'url': u'http://cs4519.vkontakte.ru/u4665445/audio/4241af71a888.mp3', u'title': u'Yesterday', u'lyrics_id': u'2365986', u'duration': 130, u'aid': 166194990, u'owner_id': 173505924} url = 'http://amalgama.mobi/songs/' for i in audios: print i['artist'] if i['artist'].startswith('The '): url += i['artist'][4:5] + '/' + i['artist'][4:].replace(' ', '_') + '/' +i['title'].replace(' ', '_') + '.html' else: url += i['artist'][:1] + '/' + i['artist'].replace(' ', '_') + '/' +i['title'].replace(' ', '_') + '.html' url = url.lower() page = urllib.urlopen(url) soup = BeautifulSoup(page.read(), fromEncoding="utf-8") texts = soup.findAll('ol', ) if len(texts) != 0: en = texts[0].text #this! ru = texts[1].text #this! vk.get('audio.edit', aid=i['aid'], oid = i['owner_id'], artist=i['artist'], title = i['title'], text = ru, no_search = 0)

3条回答

网友
1楼 · 编辑于 2024-05-13 15:05:45

尝试^{}方法的separator参数：
from bs4 import BeautifulSoup html = '''<p> Hi. This is a simple example.<br>Yet poweful one. <p>''' soup = Beautifulsoup(html) soup.get_text() # Output: u' Hi. This is a simple example.Yet poweful one. ' soup.get_text(separator=' ') # Output: u' Hi. This is a simple example. Yet poweful one. '

网友
2楼 · 编辑于 2024-05-13 15:05:45

你可以这样做：
soup = BeautifulSoup(html) ols = soup.findAll('ol') # for the two languages for ol in ols: ps = ol.findAll('p') for p in ps: for item in p.contents: if str(item)!='<br />': print str(item)

网友
3楼 · 编辑于 2024-05-13 15:05:45

我建议你调查一下the .strings generator found in Beautiful Soup 4。

相关问题更多 >

编程相关推荐

热门问题

热门文章