废租广告的基本框架
rentswatch-scraper的Python项目详细描述
这个包提供了一个简单且可维护的方法来构建 租样刮刀。Rentswatch是一项跨国界调查,收集欧洲房屋租金的数据。它的搜索引擎主要关注分类广告。
如何安装
使用pip…
安装pip install rentswatch-scraper
如何使用
让我们看一个使用rentswatch scraper的快速示例 构建一个简单的模型支持的scraper来从网站收集数据。
首先,导入包组件以构建刮刀:
#!/usr/bin/env pythonfromrentswatch_scraper.scraperimportScraperfromrentswatch_scraper.browserimportgeocode,convertfromrentswatch_scraper.fieldsimportRegexField,ComputedFieldfromrentswatch_scraperimportreporting
为了尽可能多地分解代码,我们创建了一个抽象类 每个铲运机都将执行。为了简单起见,我们将使用 虚拟网站如下:
classDummyScraper(Scraper):# Those are the basic meta-properties that define the scraper behaviorclassMeta:country='FR'site="dummy"baseUrl='http://dummy.io'listUrl=baseUrl+'/rent/city/paris/list.php'adBlockSelector='.ad-page-link'
如果没有进一步的配置,这个刮刀将开始收集 来自dummy.io列表页的广告。为了找到广告的链接 将使用css选择器.ad-page-link获取<a>标记和 遵循它们的href属性。
我们现在要教刮刀如何从广告中提取关键人物 第页。
classDummyScraper(Scraper):# HEADS UP: Meta declarations are hidden here# ...# ...# Extract data using a CSS Selector.realtorName=RegexField('.realtor-title')# Extract data using a CSS Selector and a Regex.serviceCharge=RegexField('.description-list','charges : (.*)\s€')# Extract data using a CSS Selector and a Regex.# This will throw a custom exception if the field is missing.livingSpace=RegexField('.description-list','surface :(\d*)',required=True,exception=reporting.SpaceMissingError)# Extract the value directly, without using a RegextotalRent=RegexField('.description-price',required=True,exception=reporting.RentMissingError)# Store this value as a private property (begining with a underscore).# It won't be saved in the database but it can be helpful as you we'll see._address=RegexField('.description-address')
根据广告,每个属性都将保存为广告的属性 模型。
某些属性可能无法从HTML中提取。你可能需要 使用接收现有属性的自定义函数。因为这个原因 我们创建了第二个名为ComputedField的字段类型。自从 属性声明顺序已记录,我们可以使用 声明(和提取)值以计算新值。
classDummyScraper(Scraper):# ...# ...# Use existing properties `totalRent` and `livingSpace` as they were# extracted before this one.pricePerSqm=ComputedField(fn=lambdas,values:values["totalRent"]/values["livingSpace"])# This full exemple uses private properties to find latitude and longitude.# To do so we use a buid-in function named `convert` that transforms an# address into a dictionary of coordinates._latLng=ComputedField(fn=lambdas,values:geocode(values['_address'],'FRA'))# Gets a the dictionary field we want.latitude=ComputedField(fn=lambdas,values:values['_latLng']['lat'])longitude=ComputedField(fn=lambdas,values:values['_latLng']['lng'])
现在只需创建类的实例并运行 刮刀。
# When you script is executed directlyif__name__=="__main__":dummyScraper=DummyScraper()dummyScraper.run()
API文件
classad
属性
如上所示,每个ad属性都可以用作scraper属性来声明提取哪个属性。
Name Type Description ^{tt8}$ String “listed” if needs more scraping, “scraped” if it’s done ^{tt9}$ String Name of the website ^{tt10}$ DateTime Date the ad was first scraped ^{tt11}$ String The unique ID from the site where it’s scrapped from ^{tt12}$ Float Extra costs (heating mostly) ^{tt13}$ Float Base costs (without heating) ^{tt14}$ Float Total cost ^{tt15}$ Float Surface in square meters ^{tt16}$ Float Price per square meter ^{tt17}$ Bool True if the flat or house is furnished ^{tt18}$ Bool True if realtor, n if rented by a physical person ^{tt19}$ Unicode The name of the realtor or person offering the flat ^{tt20}$ Float Latitude ^{tt21}$ Float Longitude ^{tt22}$ Bool True if there is a balcony/terrasse ^{tt23}$ String The year the building was built ^{tt24}$ Bool True if the flat comes with a cellar ^{tt25}$ Bool True if the flat comes with a parking or a garage ^{tt26}$ String House Number in the street ^{tt27}$ String Street name (incl. “street”) ^{tt28}$ String ZIP code ^{tt29}$ Unicode City ^{tt30}$ Bool True if a lift is present ^{tt31}$ String Type of flat (no typology) ^{tt32}$ String Number of rooms ^{tt33}$ String Floor the flat is at ^{tt34}$ Bool True if there is a garden ^{tt35}$ Bool True if the flat is wheelchair accessible ^{tt36}$ String Country, 2 letter code ^{tt37}$ String URL of the page
class刮刀
方法
scraper类定义了很多方法,我们鼓励您
重新定义以便完全控制刮刀行为。
Name Description ^{tt39}$ Extract ads list from a page’s soup. ^{tt40}$ Print out an error message. ^{tt41}$ Fetch a single ad page from the target website then create Ad instances by calling ^{tt42}$. ^{tt43}$ Fetch a single list page from the target website then fetch an ad by calling ^{tt41}$. ^{tt45}$ Extract ad block from a page list. Called within ^{tt43}$. ^{tt47}$ Extract a href attribute from an ad block. Called within ^{tt43}$. ^{tt49}$ Extract a siteId from an ad block. Called within ^{tt43}$. ^{tt51}$ Used internally to generate a list of property to extract from the ad. ^{tt52}$ Fetch a list page from the target website. ^{tt53}$ True if we met issues with this ad before. ^{tt54}$ True if we already scraped this ad before. ^{tt55}$ Print out an success message. ^{tt56}$ Just before saving the values. ^{tt57}$ Run the scrapper. ^{tt58}$ Transform HTML content of the series page before parsing it.
开始迁移
使用Yoyo:
yoyo new ./migrations -m "Your migration's description"
并应用它:
yoyo apply --database mysql://user:password@host/db ./migrations