Dango应用程序可刮网页
django-easy-scraper的Python项目详细描述
Django简易刮刀
一个独立的django应用程序,可以轻松地与django和no-django应用程序一起使用/初始化。抓取机制在Regular Expression
和{
它需要安装pythonrequests
模块
安装
pip install django-easy-scraper
基本用途
如果使用正则表达式:
from django_easy_scraper import scraper
class ScrapeExampleDotCom(scraper.Scraper):
regex_fields = {
'price': "Write Your Regex pattern for price here",
'title': "Write your regex pattern for title here",
# Like above way you can add as much fields/keys as you want
}
如果使用xpath:
^{pr2}$立即刮伤
url = 'www.example.com/bla-bla-details-page/
data = ScrapeExampleDotCom.regex_url_scraper(url)
print(data)
如果正则表达式模式正确,则响应应该如下所示:
{
'price': 4,
'title': 'an scraped title',
}
regex_url_scraper
方法总是给您json响应
所以,如果您在regex_fields
中添加了许多regex模式,它将用您在字典中添加的结果来响应字典键的数量。在
多个站点拼凑在一起
你不需要一直为不同的站点调用不同的方法!!打一次电话就行了,找点乐子,对吧?在
就像你要刮三个地方:
但是那些网站产品会如何自动刮蹭,这会吓到你吗?在
- 上面所有站点的Wirte Regex模式,其中包含您要清除的字段:
from django_easy_scraper import scraper
class ScrapeExampleDotCom(scraper.Scraper):
regex_fields = {
'price': "Write Your Regex pattern for price here",
'title': "Write your regex pattern for title here",
# Like above way you can add as much fields/keys as you want
}
class ScrapeExampleTwo(scraper.Scraper):
regex_fields = {
'price': "Write Your Regex pattern for price here",
'title': "Write your regex pattern for title here",
# Like above way you can add as much fields/keys as you want
}
class ScrapeExampleThree(scraper.Scraper):
regex_fields = {
'price': "Write Your Regex pattern for price here",
'title': "Write your regex pattern for title here",
# Like above way you can add as much fields/keys as you want
}
你已经为你写了正则表达式所有你要抓取的网站
现在是时候使用我们的Switch
类,它将根据您要抓取的站点路由脚本/类?酷,对吧!!在
这是魔术真正开始的地方:
把你所有的类放到字典里switcher
。在
Important Note:
key
名称应该是域名,纯域名,没有www或http或斜杠,不要添加任何前缀/后缀
value
应该是您为其编写的域的类,并将其方法放在`regex_url_scraper'中
from django_easy_scraper import switch
class Switch(switch.BaseSwitch):
switcher = {
'example.com': ScrapeExampleDotCom.regex_url_scraper,
'exampletwo.com': ScrapeExampleTwo.regex_url_scraper,
'examplethree.com': ScrapeExampleThree.regex_url_scraper,
}
If you use xpath, you have pass
xpath_scraper
instead ofregex_url_scraper
因此,您已经根据脚本/类获得的url完成了路由。在
将响应数据作为python字典获取,如上面的站点:
url = 'Any of site you have written class for the site and added in switch class'
response = Switch.get_data(url=url, raise_exception=False)
print(response) # Will give you an object of data that you trying to scrape
Switch类为您提供了工具,可以根据站点链接传递到它的get_data
方法自动路由抓取类。在
get_data
方法的raise_exception
如果您想在找不到预期字段时引发异常,它是句柄吗
有问题吗?
请在我们的github repo上打开一个问题:https://github.com/dearopen/django-easy-scraper
如果你喜欢的话,别忘了参与这个项目。在
刮擦快乐!!在
- 项目
标签: