废租广告的基本框架

rentswatch-scraper的Python项目详细描述


这个包提供了一个简单且可维护的方法来构建 租样刮刀。Rentswatch是一项跨国界调查,收集欧洲房屋租金的数据。它的搜索引擎主要关注分类广告。

如何安装

使用pip

安装
pip install rentswatch-scraper

如何使用

让我们看一个使用rentswatch scraper的快速示例 构建一个简单的模型支持的scraper来从网站收集数据。

首先,导入包组件以构建刮刀:

#!/usr/bin/env pythonfromrentswatch_scraper.scraperimportScraperfromrentswatch_scraper.browserimportgeocode,convertfromrentswatch_scraper.fieldsimportRegexField,ComputedFieldfromrentswatch_scraperimportreporting

为了尽可能多地分解代码,我们创建了一个抽象类 每个铲运机都将执行。为了简单起见,我们将使用 虚拟网站如下:

classDummyScraper(Scraper):# Those are the basic meta-properties that define the scraper behaviorclassMeta:country='FR'site="dummy"baseUrl='http://dummy.io'listUrl=baseUrl+'/rent/city/paris/list.php'adBlockSelector='.ad-page-link'

如果没有进一步的配置,这个刮刀将开始收集 来自dummy.io列表页的广告。为了找到广告的链接 将使用css选择器.ad-page-link获取<a>标记和 遵循它们的href属性。

我们现在要教刮刀如何从广告中提取关键人物 第页。

classDummyScraper(Scraper):# HEADS UP: Meta declarations are hidden here# ...# ...# Extract data using a CSS Selector.realtorName=RegexField('.realtor-title')# Extract data using a CSS Selector and a Regex.serviceCharge=RegexField('.description-list','charges : (.*)\s€')# Extract data using a CSS Selector and a Regex.# This will throw a custom exception if the field is missing.livingSpace=RegexField('.description-list','surface :(\d*)',required=True,exception=reporting.SpaceMissingError)# Extract the value directly, without using a RegextotalRent=RegexField('.description-price',required=True,exception=reporting.RentMissingError)# Store this value as a private property (begining with a underscore).# It won't be saved in the database but it can be helpful as you we'll see._address=RegexField('.description-address')

根据广告,每个属性都将保存为广告的属性 模型。

某些属性可能无法从HTML中提取。你可能需要 使用接收现有属性的自定义函数。因为这个原因 我们创建了第二个名为ComputedField的字段类型。自从 属性声明顺序已记录,我们可以使用 声明(和提取)值以计算新值。

classDummyScraper(Scraper):# ...# ...# Use existing properties `totalRent` and `livingSpace` as they were# extracted before this one.pricePerSqm=ComputedField(fn=lambdas,values:values["totalRent"]/values["livingSpace"])# This full exemple uses private properties to find latitude and longitude.# To do so we use a buid-in function named `convert` that transforms an# address into a dictionary of coordinates._latLng=ComputedField(fn=lambdas,values:geocode(values['_address'],'FRA'))# Gets a the dictionary field we want.latitude=ComputedField(fn=lambdas,values:values['_latLng']['lat'])longitude=ComputedField(fn=lambdas,values:values['_latLng']['lng'])

现在只需创建类的实例并运行 刮刀。

# When you script is executed directlyif__name__=="__main__":dummyScraper=DummyScraper()dummyScraper.run()

API文件

classad

属性

如上所示,每个ad属性都可以用作scraper属性来声明提取哪个属性。

NameTypeDescription
^{tt8}$String“listed” if needs more scraping, “scraped” if it’s done
^{tt9}$StringName of the website
^{tt10}$DateTimeDate the ad was first scraped
^{tt11}$StringThe unique ID from the site where it’s scrapped from
^{tt12}$FloatExtra costs (heating mostly)
^{tt13}$FloatBase costs (without heating)
^{tt14}$FloatTotal cost
^{tt15}$FloatSurface in square meters
^{tt16}$FloatPrice per square meter
^{tt17}$BoolTrue if the flat or house is furnished
^{tt18}$BoolTrue if realtor, n if rented by a physical person
^{tt19}$UnicodeThe name of the realtor or person offering the flat
^{tt20}$FloatLatitude
^{tt21}$FloatLongitude
^{tt22}$BoolTrue if there is a balcony/terrasse
^{tt23}$StringThe year the building was built
^{tt24}$BoolTrue if the flat comes with a cellar
^{tt25}$BoolTrue if the flat comes with a parking or a garage
^{tt26}$StringHouse Number in the street
^{tt27}$StringStreet name (incl. “street”)
^{tt28}$StringZIP code
^{tt29}$UnicodeCity
^{tt30}$BoolTrue if a lift is present
^{tt31}$StringType of flat (no typology)
^{tt32}$StringNumber of rooms
^{tt33}$StringFloor the flat is at
^{tt34}$BoolTrue if there is a garden
^{tt35}$BoolTrue if the flat is wheelchair accessible
^{tt36}$StringCountry, 2 letter code
^{tt37}$StringURL of the page

class刮刀

方法

scraper类定义了很多方法,我们鼓励您 重新定义以便完全控制刮刀行为。

NameDescription
^{tt39}$Extract ads list from a page’s soup.
^{tt40}$Print out an error message.
^{tt41}$Fetch a single ad page from the target website then create Ad instances by calling ^{tt42}$.
^{tt43}$Fetch a single list page from the target website then fetch an ad by calling ^{tt41}$.
^{tt45}$Extract ad block from a page list. Called within ^{tt43}$.
^{tt47}$Extract a href attribute from an ad block. Called within ^{tt43}$.
^{tt49}$Extract a siteId from an ad block. Called within ^{tt43}$.
^{tt51}$Used internally to generate a list of property to extract from the ad.
^{tt52}$Fetch a list page from the target website.
^{tt53}$True if we met issues with this ad before.
^{tt54}$True if we already scraped this ad before.
^{tt55}$Print out an success message.
^{tt56}$Just before saving the values.
^{tt57}$Run the scrapper.
^{tt58}$Transform HTML content of the series page before parsing it.

开始迁移

使用Yoyo

yoyo new ./migrations -m "Your migration's description"

并应用它:

yoyo apply --database mysql://user:password@host/db ./migrations

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java我是否需要构造一个带有*非final*但不可变字段的不可变类?   java如何使用jaxb读取属性?   java为什么不打印空值以外的任何内容?   java Struts2如何在不使用struts的情况下重定向到操作。xml?   java方法参数未在其实现中使用   在Java中更改终端内部的变量   Spring中的java依赖项注入失败   java如何使用getAttribute Selenium防止获取重复的HREF   优先级队列的java顺序不符合预期   java如何使用Spring TaskExecutor在应用程序的所有请求中使用单个任务池   java Firebase RecyclerView不会从数据库中检索项目并将其显示在屏幕上。屏幕是空的   java将YUV_420_888转换为字节数组   spring停止使用Java缓存文件   java在执行maven clean安装时,我在eclipse智能家居中遇到了这种错误   stream Java=下载缓冲区未满?冲洗/缓冲是如何工作的?   查询SQL server时重置java JDBC连接   java如何避免在两个函数中使用相同的逻辑。   转换java。lang.Boolean到Scala Boolean