将BeautifulSoup输出保存到Mongo并重新加载
我有一个爬虫,它会获取一些网页供我的应用使用。
我想把事情分开,爬虫应该是“傻傻的”,只负责抓取网页,拿到BeautifulSoup生成的JSON,然后把它保存到MongoDB里。
接下来,其他的工作程序会读取MongoDB里的文档,从中提取出相关的信息,转化成关系型模型。
问题是,怎么安全地把BeautifulSoup对象转换成JSON(MongoDB文档),然后再安全地转换回来,确保没有错误。
编辑:示例
import urllib2
import json
from bs4 import BeautifulSoup
req = urllib2.Request('http://www.google.com')
res = urllib2.urlopen(req)
soup = BeautifulSoup(res.read())
content = soup.findAll(text=True)
soup_json = json.dumps(content)
soup_json
输出:
'["doctype html", "Google", "(function(){\\nwindow.google={kEI:\\"LGktU9bfHqHk4wT1poGoAg\\",getEI:function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute(\\"eid\\")));)a=a.parentNode;return b||google.kEI},https:function(){return\\"https:\\"==window.location.protocol},kEXPI:\\"4006,17259,4000116,4007661,4007830,4008067,4008133,4008142,4009033,4009352,4009565,4009641,4010297,4010806,4010858,4010899,4011228,4011258,4011679,4011959,4012373,4012504,4012507,4013338,4013374,4013414,4013416,4013591,4013723,4013747,4013787,4013823,4013967,4013979,4014016,4014431,4014515,4014636,4014649,4014671,4014792,4014804,4014813,4014991,4015119,4015155,4015195,4015234,4015260,4015320,4015444,4015497,4015514,4015582,4015589,4015637,4015638,4015640,4015690,4015772,4015853,4015904,4015991,4015995,4016007,4016047,4016062,4016139,4016167,4016193,4016304,4016311,4016407,8300007,8300015,8300018,8500149,8500157,10200002,10200012,10200029,10200030,10200040,10200045,10200048,10200053,10200055,10200066,10200083,10200103,10200120,10200134,10200157\\",kCSI:{e:\\"4006,17259,4000116,4007661,4007830,4008067,4008133,4008142,4009033,4009352,4009565,4009641,4010297,4010806,4010858,4010899,4011228,4011258,4011679,4011959,4012373,4012504,4012507,4013338,4013374,4013414,4013416,4013591,4013723,4013747,4013787,4013823,4013967,4013979,4014016,4014431,4014515,4014636,4014649,4014671,4014792,4014804,4014813,4014991,4015119,4015155,4015195,4015234,4015260,4015320,4015444,4015497,4015514,4015582,4015589,4015637,4015638,4015640,4015690,4015772,4015853,4015904,4015991,4015995,4016007,4016047,4016062,4016139,4016167,4016193,4016304,4016311,4016407,8300007,8300015,8300018,8500149,8500157,10200002,10200012,10200029,10200030,10200040,10200045,10200048,10200053,10200055,10200066,10200083,10200103,10200120,10200134,10200157\\",ei:\\"LGktU9bfHqHk4wT1poGoAg\\"},authuser:0,ml:function(){},kHL:\\"iw\\",time:function(){return(new Date).getTime()},log:function(a,b,c,h,k){var d=\\nnew Image,f=google.lc,e=google.li,g=\\"\\";d.onerror=d.onload=d.onabort=function(){delete f[e]};f[e]=d;c||-1!=b.search(\\"&ei=\\")||(g=\\"&ei=\\"+google.getEI(h));c=c||\\"/\\"+(k||\\"gen_204\\")+\\"?atyp=i&ct=\\"+a+\\"&cad=\\"+b+g+\\"&zx=\\"+google.time();a=/^http:/i;a.test(c)&&google.https()?(google.ml(Error(\\"GLMM\\"),!1,{src:c}),delete f[e]):(d.src=c,google.li=e+1)},lc:[],li:0,y:{},x:function(a,b){google.y[a.id]=[a,b];return!1},load:function(a,b,c){google.x({id:a+l++},function(){google.load(a,b,c)})}};var l=0;})();\\n(function(){google.sn=\\"webhp\\";google.timers={};google.startTick=function(a,b){google.timers[a]={t:{start:google.time()},bfr:!!b}};google.tick=function(a,b,g){google.timers[a]||google.startTick(a);google.timers[a].t[b]=g||google.time()};google.startTick(\\"load\\",!0);\\ntry{}catch(d){}})();\\nvar _gjwl=location;function _gjuc(){var a=_gjwl.href.indexOf(\\"#\\");if(0<=a&&(a=_gjwl.href.substring(a),0<a.indexOf(\\"&q=\\")||0<=a.indexOf(\\"#q=\\"))&&(a=a.substring(1),-1==a.indexOf(\\"#\\"))){for(var d=0;d<a.length;){var b=d;\\"&\\"==a.charAt(b)&&++b;var c=a.indexOf(\\"&\\",b);-1==c&&(c=a.length);b=a.substring(b,c);if(0==b.indexOf(\\"fp=\\"))a=a.substring(0,d)+a.substring(c,a.length),c=d;else if(\\"cad=h\\"==b)return 0;d=c}_gjwl.href=\\"/search?\\"+a+\\"&cad=h\\";return 1}return 0}\\nfunction _gjh(){!_gjuc()&&window.google&&google.x&&google.x({id:\\"GJH\\"},function(){google.nav&&google.nav.gjh&&google.nav.gjh()})};\\nwindow._gjh&&_gjh();", "#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:left}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-left:.5em;vertical-align:top}#gbar{float:right}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}", "body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;overflow-y:scroll}#gog{padding:3px 8px 0}td{line-height:.8em}.gac_m td{line-height:17px}form{margin-bottom:20px}.h{color:#36c}.q{color:#00c}.ts td{padding:0}.ts{border-collapse:collapse}em{font-weight:bold;font-style:normal}.lst{height:25px;width:496px}.gsfi,.lst{font:18px arial,sans-serif}.gsfs{font:17px arial,sans-serif}.ds{display:inline-box;display:inline-block;margin:3px 0 4px;margin-right:4px}input{font-family:inherit}a.gb1,a.gb2,a.gb3,a.gb4{color:#11c !important}body{background:#fff;color:black}a{color:#11c;text-decoration:none}a:hover,a:active{text-decoration:underline}.fl a{color:#36c}a:visited{color:#551a8b}a.gb1,a.gb4{text-decoration:underline}a.gb3:hover{text-decoration:none}#ghead a.gb2:hover{color:#fff !important}.sblc{padding-top:5px}.sblc a{display:block;margin:2px 0;margin-right:13px;font-size:11px}.lsbb{background:#eee;border:solid 1px;border-color:#ccc #ccc #999 #999;height:30px}.lsbb{display:block}.ftl,#fll a{display:inline-block;margin:0 12px}.lsb{background:url(/images/srpr/nav_logo80.png) 0 -258px repeat-x;border:none;color:#000;cursor:pointer;height:30px;margin:0;outline:0;font:15px arial,sans-serif;vertical-align:top}.lsb:active{background:#ccc}.lst:focus{outline:none}#addlang a{padding:0 3px}.tiah{width:458px}", "(function(){var src=\'/images/nav_logo176.png\';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}\\nif (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}\\n}\\n})();", " ", "\\u00e7\\u00e9\\u00f4\\u00e5\\u00f9", " ", "\\u00fa\\u00ee\\u00e5\\u00f0\\u00e5\\u00fa", " ", "\\u00ee\\u00f4\\u00e5\\u00fa", " ", "YouTube", " ", "\\u00e7\\u00e3\\u00f9\\u00e5\\u00fa", " ", "Gmail", " ", "Drive", " ", "\\u00e9\\u00e5\\u00ee\\u00ef", " ", "\\u00f2\\u00e5\\u00e3", " \\u00bb", "\\u00e4\\u00e9\\u00f1\\u00e8\\u00e5\\u00f8\\u00e9\\u00e9\\u00fa \\u00e0\\u00fa\\u00f8\\u00e9\\u00ed", " | ", "\\u00e4\\u00e2\\u00e3\\u00f8\\u00e5\\u00fa", " | ", "\\u00e4\\u00e9\\u00eb\\u00f0\\u00f1", " ", "\\u00e9\\u00f9\\u00f8\\u00e0\\u00ec", "\\u00a0", "\\u00e7\\u00e9\\u00f4\\u00e5\\u00f9 \\u00ee\\u00fa\\u00f7\\u00e3\\u00ed", "\\u00eb\\u00ec\\u00e9 \\u00f9\\u00f4\\u00e4", "Google.co.il \\u00e2\\u00ed \\u00e1: ", "\\u0627\\u0644\\u0639\\u0631\\u0628\\u064a\\u0629", " ", "English", " \\u00f4\\u00f8\\u00f1\\u00e5\\u00ed \\u00e1-Google", "\\u00f4\\u00fa\\u00f8\\u00e5\\u00f0\\u00e5\\u00fa \\u00f2\\u00f1\\u00f7\\u00e9\\u00e9\\u00ed", "\\u00e4\\u00eb\\u00ec \\u00e0\\u00e5\\u00e3\\u00e5\\u00fa Google", "Google.com", "\\u00a9 2013 - ", "\\u00f4\\u00f8\\u00e8\\u00e9\\u00e5\\u00fa \\u00e5\\u00fa\\u00f0\\u00e0\\u00e9\\u00ed", "if(google.y)google.y.first=[];(function(){function b(a){window.setTimeout(function(){var c=document.createElement(\\"script\\");c.src=a;document.getElementById(\\"xjsd\\").appendChild(c)},0)}google.dljp=function(a){google.xjsu=a;b(a)};google.dlj=b;})();\\nif(!google.xjs){window._=window._||{};window._._DumpException=function(e){throw e};if(google.timers&&google.timers.load.t){google.timers.load.t.xjsls=new Date().getTime();}google.dljp(\'/xjs/_/js/k\\\\x3dxjs.hp.en_US.X67G-1Nbjpc.O/m\\\\x3dsb_he,pcc/rt\\\\x3dj/d\\\\x3d1/sv\\\\x3d1/rs\\\\x3dAItRSTO_vkVhEK6twEUdYclvmSrFcRL-Zw\');google.xjs=1;}google.pmc={\\"sb_he\\":{\\"agen\\":true,\\"cgen\\":true,\\"client\\":\\"heirloom-hp\\",\\"dh\\":true,\\"ds\\":\\"\\",\\"eqch\\":true,\\"fl\\":true,\\"host\\":\\"google.co.il\\",\\"jsonp\\":true,\\"msgs\\":{\\"dym\\":\\"\\u00e4\\u00e0\\u00ed \\u00e4\\u00fa\\u00eb\\u00e5\\u00e5\\u00f0\\u00fa \\u00ec:\\",\\"lcky\\":\\"\\u00e9\\u00e5\\u00fa\\u00f8 \\u00ee\\u00e6\\u00ec \\u00ee\\u00f9\\u00eb\\u00ec\\",\\"lml\\":\\"\\u00ec\\u00ee\\u00e9\\u00e3\\u00f2 \\u00f0\\u00e5\\u00f1\\u00f3\\",\\"oskt\\":\\"\\u00eb\\u00ec\\u00e9 \\u00e4\\u00e6\\u00f0\\u00e4\\",\\"psrc\\":\\"\\u00e7\\u00e9\\u00f4\\u00e5\\u00f9 \\u00e6\\u00e4 \\u00e4\\u00e5\\u00f1\\u00f8 \\u00ee\\\\u003Ca href=\\\\\\"/history\\\\\\"\\\\u003E\\u00e4\\u00e9\\u00f1\\u00e8\\u00e5\\u00f8\\u00e9\\u00e9\\u00fa \\u00e4\\u00e0\\u00e9\\u00f0\\u00e8\\u00f8\\u00f0\\u00e8\\\\u003C/a\\\\u003E \\u00f9\\u00ec\\u00ea\\",\\"psrl\\":\\"\\u00e4\\u00f1\\u00f8\\",\\"sbit\\":\\"\\u00e7\\u00f4\\u00f9 \\u00ec\\u00f4\\u00e9 \\u00fa\\u00ee\\u00e5\\u00f0\\u00e4\\",\\"srch\\":\\"\\u00e7\\u00e9\\u00f4\\u00e5\\u00f9 \\u00e1-Google\\"},\\"ovr\\":{},\\"pq\\":\\"\\",\\"qcpw\\":false,\\"scd\\":10,\\"sce\\":5,\\"stok\\":\\"AVgtYJUWkObPx6V5QqvD7hitdNE\\"},\\"pcc\\":{}};google.y.first.push(function(){if(google.med){google.med(\'init\');google.initHistory();google.med(\'history\');}});if(google.j&&google.j.en&&google.j.xi){window.setTimeout(google.j.xi,0);}", "(function(){if(google.timers&&google.timers.load.t){var b,c,d,e,g=function(a,f){a.removeEventListener?(a.removeEventListener(\\"load\\",f,!1),a.removeEventListener(\\"error\\",f,!1)):(a.detachEvent(\\"onload\\",f),a.detachEvent(\\"onerror\\",f))},h=function(a){e=(new Date).getTime();++c;a=a||window.event;a=a.target||a.srcElement;g(a,h)},k=document.getElementsByTagName(\\"img\\");b=k.length;for(var l=c=0,m;l<b;++l)m=k[l],m.complete||\\"string\\"!=typeof m.src||!m.src?++c:m.addEventListener?(m.addEventListener(\\"load\\",h,!1),m.addEventListener(\\"error\\",\\nh,!1)):(m.attachEvent(\\"onload\\",h),m.attachEvent(\\"onerror\\",h));d=b-c;var n=function(){if(google.timers.load.t){google.timers.load.t.ol=(new Date).getTime();google.timers.load.t.iml=e;google.kCSI.imc=c;google.kCSI.imn=b;google.kCSI.imp=d;void 0!==google.stt&&(google.kCSI.stt=google.stt);google.csiReport&&google.csiReport()}};window.addEventListener?window.addEventListener(\\"load\\",n,!1):window.attachEvent&&\\nwindow.attachEvent(\\"onload\\",n);google.timers.load.t.prt=e=(new Date).getTime()};})();\\n"]'
这个JSON应该保存到MongoDB中,以便我以后可以从中恢复出一个BeautifulSoup对象。
1 个回答
顺便说一下,其实你在把数据存入MongoDB(或任何数据库)之前,并不需要先把它处理成“汤”。
以下是我的理由:
(1) 当你把数据处理成“汤”,也就是一个名为 'bs4.BeautifulSoup' 的类时,存入MongoDB后,它会变成文本格式,可能是json格式或者其他格式。下次你从数据库取数据时,需要再次调用BeautifulSoup的功能,把字符串/json再转回“汤”,这样你就实际上处理了两次。
(2) “汤”其实就是基于HTML页面构建的一个xml树。BeautifulSoup会解析这个树,有时候会修复一些损坏或缺失的标签,还会做一些你可能并不想要的“智能处理”,或者稍微修改一下HTML页面。例如,根据你使用的解析器(比如“lxml”或“html5”),你可能会得到不同类型的结果。所以在存储数据之前使用BeautifulSoup,可能会让你遇到麻烦。
总之,我建议你直接存储原始的HTML内容,不要做任何处理。存储它们最简单的方法就是按照以下格式构建文档:
{"url":"www.xxx.com/..", "html":"<DOCTYPE!>...."}
这样一来,你基本上就是把网站的内容镜像/索引到了你的本地机器上,不会遗漏任何信息。
这里有一些代码可以帮助你使用MongoDB存储和获取HTML:
>>> from pymongo import MongoClient
>>> client = MongoClient('localhost', 27017)
>>> db = client.oleg
>>>
>>> # get the raw html
... url = "http://www.crummy.com/software/BeautifulSoup/bs4/doc/#"
>>> import urllib2
>>> html = urllib2.urlopen(url).read()
>>> html[:100]
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n "http://www.w3.org/TR/xhtml1/DTD/xh'
>>>
>>>
>>> # store the <key:value> -> <url:html> into mongo for later use
... db.tikhonov.insert({"url":url, "html":html})
ObjectId('532e6904866cd3431a90c618')
>>>
>>> # retrieve the stored html by search the url
... record = db.tikhonov.find_one({"url":url})
>>> record['url']
u'http://www.crummy.com/software/BeautifulSoup/bs4/doc/#'
>>>
>>> # turn html txt into soup and start parsing
... from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(record['html'])
>>> soup.find("h1").text
u'Beautiful Soup Documentation\xb6'
附注: 把“提取HTML”的步骤和“解析”的步骤分开是个非常好的主意。你可以开始收集HTML页面,而不需要进行解析,因为通常HTTP请求是最耗时的。你可以先收集原始的HTML页面,同时编写和测试你的解析器。
在抓取或本地存储知识产权之前,一定要查看服务条款。