慢速HTML解析器。如何提高速度?
我想估算新闻对道琼斯指数的影响。为此,我写了一个用Python编写的HTML解析器,使用了beautifulsoup这个库。我提取了一篇文章,并将其存储在XML文件中,以便后续使用NLTK库进行分析。请问我该如何提高解析的速度?下面的代码可以完成这个任务,但速度非常慢。
以下是HTML解析器的代码:
import urllib2
import re
import xml.etree.cElementTree as ET
import nltk
from bs4 import BeautifulSoup
from datetime import date
from dateutil.rrule import rrule, DAILY
from nltk.corpus import stopwords
from collections import defaultdict
def main_parser():
#starting date
a = date(2014, 3, 27)
#ending date
b = date(2014, 3, 27)
articles = ET.Element("articles")
f = open('~/Documents/test.xml', 'w')
#loop through the links and per each link extract the text of the article, store the latter at xml file
for dt in rrule(DAILY, dtstart=a, until=b):
url = "http://www.reuters.com/resources/archive/us/" + dt.strftime("%Y") + dt.strftime("%m") + dt.strftime("%d") + ".html"
page = urllib2.urlopen(url)
#use html5lib ??? possibility to use another parser
soup = BeautifulSoup(page.read(), "html5lib")
article_date = ET.SubElement(articles, "article_date")
article_date.text = str(dt)
for links in soup.find_all("div", "headlineMed"):
anchor_tag = links.a
if not 'video' in anchor_tag['href']:
try:
article_time = ET.SubElement(article_date, "article_time")
article_time.text = str(links.text[-11:])
article_header = ET.SubElement(article_time, "article_name")
article_header.text = str(anchor_tag.text)
article_link = ET.SubElement(article_time, "article_link")
article_link.text = str(anchor_tag['href']).encode('utf-8')
try:
article_text = ET.SubElement(article_time, "article_text")
#get text and remove all stop words
article_text.text = str(remove_stop_words(extract_article(anchor_tag['href']))).encode('ascii','ignore')
except Exception:
pass
except Exception:
pass
tree = ET.ElementTree(articles)
tree.write("~/Documents/test.xml","utf-8")
#getting the article text from the spicific url
def extract_article(url):
plain_text = ""
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html, "html5lib")
tag = soup.find_all("p")
#replace all html tags
plain_text = re.sub(r'<p>|</p>|[|]|<span class=.*</span>|<a href=.*</a>', "", str(tag))
plain_text = plain_text.replace(", ,", "")
return str(plain_text)
def remove_stop_words(text):
text=nltk.word_tokenize(text)
filtered_words = [w for w in text if not w in stopwords.words('english')]
return ' '.join(filtered_words)
2 个回答
0
你想选择最好的解析器。
我们在构建的时候对大多数解析器和平台进行了性能测试:http://serpapi.com
这里有一篇完整的文章,可以在Medium上查看:https://medium.com/@vikoky/fastest-html-parser-available-now-f677a68b81dd
1
有几种方法可以解决这个问题(不需要更换你现在使用的模块):
- 使用
lxml
解析器,而不是html5lib
- 它快得多(快了好多好多倍) - 只解析文档的一部分,可以使用 SoupStrainer(注意,
html5lib
不支持SoupStrainer
- 它总是会慢慢解析整个文档)
下面是修改后代码的样子。简单的性能测试显示至少提升了3倍:
import urllib2
import xml.etree.cElementTree as ET
from datetime import date
from bs4 import SoupStrainer, BeautifulSoup
import nltk
from dateutil.rrule import rrule, DAILY
from nltk.corpus import stopwords
def main_parser():
a = b = date(2014, 3, 27)
articles = ET.Element("articles")
for dt in rrule(DAILY, dtstart=a, until=b):
url = "http://www.reuters.com/resources/archive/us/" + dt.strftime("%Y") + dt.strftime("%m") + dt.strftime(
"%d") + ".html"
links = SoupStrainer("div", "headlineMed")
soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=links)
article_date = ET.SubElement(articles, "article_date")
article_date.text = str(dt)
for link in soup.find_all('a'):
if not 'video' in link['href']:
try:
article_time = ET.SubElement(article_date, "article_time")
article_time.text = str(link.text[-11:])
article_header = ET.SubElement(article_time, "article_name")
article_header.text = str(link.text)
article_link = ET.SubElement(article_time, "article_link")
article_link.text = str(link['href']).encode('utf-8')
try:
article_text = ET.SubElement(article_time, "article_text")
article_text.text = str(remove_stop_words(extract_article(link['href']))).encode('ascii', 'ignore')
except Exception:
pass
except Exception:
pass
tree = ET.ElementTree(articles)
tree.write("~/Documents/test.xml", "utf-8")
def extract_article(url):
paragraphs = SoupStrainer('p')
soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=paragraphs)
return soup.text
def remove_stop_words(text):
text = nltk.word_tokenize(text)
filtered_words = [w for w in text if not w in stopwords.words('english')]
return ' '.join(filtered_words)
注意,我已经从 extract_article()
中去掉了正则表达式处理 - 看起来你可以直接从 p 标签中获取整个文本。
我可能引入了一些问题 - 请检查一下是否一切正确。
另一种解决方案是使用 lxml
来处理所有内容,从解析(替换 beautifulSoup
)到创建 XML(替换 xml.etree.ElementTree
)。
还有一种解决方案(绝对是最快的)是切换到 Scrapy 这个网络爬虫框架。 它简单且非常快。里面包含了你能想象到的各种功能。例如,有链接提取器、XML 导出器、数据库管道等等。值得一看。
希望这些对你有帮助。