Python rentswatch-scraper包_程序模块 - PyPI

废租广告的基本框架

rentswatch-scraper的Python项目详细描述

这个包提供了一个简单且可维护的方法来构建租样刮刀。Rentswatch是一项跨国界调查，收集欧洲房屋租金的数据。它的搜索引擎主要关注分类广告。

如何安装

使用pip…

安装

pip install rentswatch-scraper

如何使用

让我们看一个使用rentswatch scraper的快速示例构建一个简单的模型支持的scraper来从网站收集数据。

首先，导入包组件以构建刮刀：

#!/usr/bin/env pythonfromrentswatch_scraper.scraperimportScraperfromrentswatch_scraper.browserimportgeocode,convertfromrentswatch_scraper.fieldsimportRegexField,ComputedFieldfromrentswatch_scraperimportreporting

为了尽可能多地分解代码，我们创建了一个抽象类每个铲运机都将执行。为了简单起见，我们将使用 虚拟网站如下：

classDummyScraper(Scraper):# Those are the basic meta-properties that define the scraper behaviorclassMeta:country='FR'site="dummy"baseUrl='http://dummy.io'listUrl=baseUrl+'/rent/city/paris/list.php'adBlockSelector='.ad-page-link'

如果没有进一步的配置，这个刮刀将开始收集来自dummy.io列表页的广告。为了找到广告的链接将使用css选择器.ad-page-link获取<a>标记和遵循它们的href属性。

我们现在要教刮刀如何从广告中提取关键人物第页。

classDummyScraper(Scraper):# HEADS UP: Meta declarations are hidden here# ...# ...# Extract data using a CSS Selector.realtorName=RegexField('.realtor-title')# Extract data using a CSS Selector and a Regex.serviceCharge=RegexField('.description-list','charges : (.*)\s€')# Extract data using a CSS Selector and a Regex.# This will throw a custom exception if the field is missing.livingSpace=RegexField('.description-list','surface :(\d*)',required=True,exception=reporting.SpaceMissingError)# Extract the value directly, without using a RegextotalRent=RegexField('.description-price',required=True,exception=reporting.RentMissingError)# Store this value as a private property (begining with a underscore).# It won't be saved in the database but it can be helpful as you we'll see._address=RegexField('.description-address')

根据广告，每个属性都将保存为广告的属性模型。

某些属性可能无法从HTML中提取。你可能需要使用接收现有属性的自定义函数。因为这个原因我们创建了第二个名为ComputedField的字段类型。自从属性声明顺序已记录，我们可以使用声明（和提取）值以计算新值。

classDummyScraper(Scraper):# ...# ...# Use existing properties `totalRent` and `livingSpace` as they were# extracted before this one.pricePerSqm=ComputedField(fn=lambdas,values:values["totalRent"]/values["livingSpace"])# This full exemple uses private properties to find latitude and longitude.# To do so we use a buid-in function named `convert` that transforms an# address into a dictionary of coordinates._latLng=ComputedField(fn=lambdas,values:geocode(values['_address'],'FRA'))# Gets a the dictionary field we want.latitude=ComputedField(fn=lambdas,values:values['_latLng']['lat'])longitude=ComputedField(fn=lambdas,values:values['_latLng']['lng'])

现在只需创建类的实例并运行刮刀。

# When you script is executed directlyif__name__=="__main__":dummyScraper=DummyScraper()dummyScraper.run()

API文件

`class`ad

属性
如上所示，每个ad属性都可以用作scraper属性来声明提取哪个属性。
Name Type Description
^{tt8}$ String “listed” if needs more scraping, “scraped” if it’s done
^{tt9}$ String Name of the website
^{tt10}$ DateTime Date the ad was first scraped
^{tt11}$ String The unique ID from the site where it’s scrapped from
^{tt12}$ Float Extra costs (heating mostly)
^{tt13}$ Float Base costs (without heating)
^{tt14}$ Float Total cost
^{tt15}$ Float Surface in square meters
^{tt16}$ Float Price per square meter
^{tt17}$ Bool True if the flat or house is furnished
^{tt18}$ Bool True if realtor, n if rented by a physical person
^{tt19}$ Unicode The name of the realtor or person offering the flat
^{tt20}$ Float Latitude
^{tt21}$ Float Longitude
^{tt22}$ Bool True if there is a balcony/terrasse
^{tt23}$ String The year the building was built
^{tt24}$ Bool True if the flat comes with a cellar
^{tt25}$ Bool True if the flat comes with a parking or a garage
^{tt26}$ String House Number in the street
^{tt27}$ String Street name (incl. “street”)
^{tt28}$ String ZIP code
^{tt29}$ Unicode City
^{tt30}$ Bool True if a lift is present
^{tt31}$ String Type of flat (no typology)
^{tt32}$ String Number of rooms
^{tt33}$ String Floor the flat is at
^{tt34}$ Bool True if there is a garden
^{tt35}$ Bool True if the flat is wheelchair accessible
^{tt36}$ String Country, 2 letter code
^{tt37}$ String URL of the page

Name	Type	Description
^{tt8}$	String	“listed” if needs more scraping, “scraped” if it’s done
^{tt9}$	String	Name of the website
^{tt10}$	DateTime	Date the ad was first scraped
^{tt11}$	String	The unique ID from the site where it’s scrapped from
^{tt12}$	Float	Extra costs (heating mostly)
^{tt13}$	Float	Base costs (without heating)
^{tt14}$	Float	Total cost
^{tt15}$	Float	Surface in square meters
^{tt16}$	Float	Price per square meter
^{tt17}$	Bool	True if the flat or house is furnished
^{tt18}$	Bool	True if realtor, n if rented by a physical person
^{tt19}$	Unicode	The name of the realtor or person offering the flat
^{tt20}$	Float	Latitude
^{tt21}$	Float	Longitude
^{tt22}$	Bool	True if there is a balcony/terrasse
^{tt23}$	String	The year the building was built
^{tt24}$	Bool	True if the flat comes with a cellar
^{tt25}$	Bool	True if the flat comes with a parking or a garage
^{tt26}$	String	House Number in the street
^{tt27}$	String	Street name (incl. “street”)
^{tt28}$	String	ZIP code
^{tt29}$	Unicode	City
^{tt30}$	Bool	True if a lift is present
^{tt31}$	String	Type of flat (no typology)
^{tt32}$	String	Number of rooms
^{tt33}$	String	Floor the flat is at
^{tt34}$	Bool	True if there is a garden
^{tt35}$	Bool	True if the flat is wheelchair accessible
^{tt36}$	String	Country, 2 letter code
^{tt37}$	String	URL of the page

`class`刮刀

方法
scraper类定义了很多方法，我们鼓励您重新定义以便完全控制刮刀行为。
Name Description
^{tt39}$ Extract ads list from a page’s soup.
^{tt40}$ Print out an error message.
^{tt41}$ Fetch a single ad page from the target website then create Ad instances by calling ^{tt42}$.
^{tt43}$ Fetch a single list page from the target website then fetch an ad by calling ^{tt41}$.
^{tt45}$ Extract ad block from a page list. Called within ^{tt43}$.
^{tt47}$ Extract a href attribute from an ad block. Called within ^{tt43}$.
^{tt49}$ Extract a siteId from an ad block. Called within ^{tt43}$.
^{tt51}$ Used internally to generate a list of property to extract from the ad.
^{tt52}$ Fetch a list page from the target website.
^{tt53}$ True if we met issues with this ad before.
^{tt54}$ True if we already scraped this ad before.
^{tt55}$ Print out an success message.
^{tt56}$ Just before saving the values.
^{tt57}$ Run the scrapper.
^{tt58}$ Transform HTML content of the series page before parsing it.

Name	Description
^{tt39}$	Extract ads list from a page’s soup.
^{tt40}$	Print out an error message.
^{tt41}$	Fetch a single ad page from the target website then create Ad instances by calling ^{tt42}$.
^{tt43}$	Fetch a single list page from the target website then fetch an ad by calling ^{tt41}$.
^{tt45}$	Extract ad block from a page list. Called within ^{tt43}$.
^{tt47}$	Extract a href attribute from an ad block. Called within ^{tt43}$.
^{tt49}$	Extract a siteId from an ad block. Called within ^{tt43}$.
^{tt51}$	Used internally to generate a list of property to extract from the ad.
^{tt52}$	Fetch a list page from the target website.
^{tt53}$	True if we met issues with this ad before.
^{tt54}$	True if we already scraped this ad before.
^{tt55}$	Print out an success message.
^{tt56}$	Just before saving the values.
^{tt57}$	Run the scrapper.
^{tt58}$	Transform HTML content of the series page before parsing it.

开始迁移

使用Yoyo：

yoyo new ./migrations -m "Your migration's description"

并应用它：

yoyo apply --database mysql://user:password@host/db ./migrations

欢迎加入QQ群-->： 979659372

rentswatch-scraper 1.0.1

rentswatch-scraper的Python项目详细描述

如何安装

如何使用

API文件

`class`ad

`class`刮刀

开始迁移

推荐PyPI第三方库

setuptools-changelog

syned

mt103

djangocms-blocks

yeecli

django-qurl-templatetag

coffeecam

odoo10-addon-product-price-categor

pytest-spec

pyhdfview

mediamosa

nvstrings-cuda92

makinyan

large-image-source-mapnik

mondemand

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

rentswatch-scraper 1.0.1

rentswatch-scraper的Python项目详细描述

如何安装

如何使用

API文件

classad

class刮刀

开始迁移

推荐PyPI第三方库

setuptools-changelog

syned

mt103

djangocms-blocks

yeecli

django-qurl-templatetag

coffeecam

odoo10-addon-product-price-categor

pytest-spec

pyhdfview

mediamosa

nvstrings-cuda92

makinyan

large-image-source-mapnik

mondemand

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

`class`ad

`class`刮刀

导航栏

项目链接

标签