我如何才能最好地隔离2个不同的未标记的html片段使用美丽的汤打印到CSV？

import re import urllib2 from bs4 import BeautifulSoup page = urllib2.urlopen('http://www.indiainfoline.com/Markets/Company/A.aspx').read() soup = BeautifulSoup(page) soup.prettify() pattern = re.compile(r'^\/Markets\/Company\/\D\.aspx$') all_links = [] navigation_links = [] root = "http://www.indiainfoline.com/" # Finding all links for anchor in soup.findAll('a', href=True): all_links.append(anchor['href']) # Isolate links matching regex for link in all_links: if re.match(pattern, link): navigation_links.append(root + re.match(pattern, link).group(0)) navigation_links = list(set(navigation_links)) company_pages = [] for page in navigation_links: for anchor in soup.findAll('table', id='AlphaQuotes1_Rep_quote') [0].findAll('a',href=True): company_pages.append(root + anchor['href'])

1条回答

网友

1楼 · 发布于 2024-04-19 10:22:19

一件件地。获取每个公司的链接很容易：

from bs4 import BeautifulSoup
import requests

html = requests.get('http://www.indiainfoline.com/Markets/Company/A.aspx').text
bs = BeautifulSoup(html)

# find the links to companies
company_menu = bs.find("div",{'style':'padding-left:5px'})
# print all companies links
companies = company_menu.find_all('a')
for company in companies:
    print company['href']

其次，获取公司名称：

for company in companies:
    print company.getText().strip()

第三，电子邮件有点复杂，但您可以在这里使用regex，因此在独立的公司页面中，请执行以下操作：

import re
# example company page
html = requests.get('http://www.indiainfoline.com/Markets/Company/Adani-Power-Ltd/533096').text
EMAIL_REGEX = re.compile("mailto:([A-Za-z0-9.\-+]+@[A-Za-z0-9_\-]+[.][a-zA-Z]{2,4})")
re.findall(EMAIL_REGEX, html)
# and there you got a list of found emails
...

干杯

相关问题更多 >

编程相关推荐

热门问题

热门文章