从https://schema.org/recipe格式的html结构化数据中提取烹饪配方。
scrape-schema-recipe的Python项目详细描述
刮模式配方
将htmlhttps://schema.org/Recipe(microdata/json-ld)中的配方刮到python字典中。
安装
pip install scrape-schema-recipe
要求
python版本3.5+
这个库在很大程度上依赖于extruct。
其他要求:
- 等日期(>;=0.5.1)
- 请求
- 验证器(>;=12.4)。
联机示例
>>>importscrape_schema_recipe>>>url='https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'>>>recipe_list=scrape_schema_recipe.scrape_url(url,python_objects=True)>>>len(recipe_list)1>>>recipe=recipe_list[0]# Name of the recipe>>>recipe['name']'Honey Mustard Dressing'# List of the Ingredients>>>recipe['recipeIngredient']['5 tablespoons medium body honey (sourwood is nice)','3 tablespoons smooth Dijon mustard','2 tablespoons rice wine vinegar']# List of the Instructions>>>recipe['recipeInstructions']['Combine all ingredients in a bowl and whisk until smooth. Serve as a dressing or a dip.']# Author>>>recipe['author'][{'@type':'Person','name':'Alton Brown','url':'https://www.foodnetwork.com/profiles/talent/alton-brown'}]
“@type”:“person”是一个https://schema.org/Person对象
# Preparation Time>>>recipe['prepTime']datetime.timedelta(0,300)# The library pendulum can give you something a little easier to read.>>>importpendulum# for pendulum version 1.0>>>pendulum.Interval.instanceof(recipe['prepTime'])<Interval[5minutes]># for version 2.0 of pendulum>>>pendulum.Duration(seconds=recipe['prepTime'].total_seconds())<Duration[5minutes]>
如果将python_objects
设置为False
,则将返回表示持续时间的字符串iso8611,'PT5M'
# Publication date>>>recipe['datePublished']datetime.datetime(2016,11,13,21,5,50,518000,tzinfo=<FixedOffset'-05:00'>)>>>str(recipe['datePublished'])'2016-11-13 21:05:50.518000-05:00'# Identifying this is http://schema.org/Recipe data (in LD-JSON format)>>>recipe['@context'],recipe['@type']('http://schema.org','Recipe')# Content's URL>>>recipe['url']'https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'# all the keys in this dictionary>>>recipe.keys()dict_keys(['recipeYield','totalTime','dateModified','url','@context','name','publisher','prepTime','datePublished','recipeIngredient','@type','recipeInstructions','author','mainEntityOfPage','aggregateRating','recipeCategory','image','headline','review'])
来自文件的示例(可选表示)
也适用于本地保存的HTML example file。
>>>filelocation='test_data/google-recipe-example.html'>>>recipe_list=scrape_schema_recipe.scrape(filelocation,python_objects=True)>>>recipe=recipe_list[0]>>>recipe['name']'Party Coffee Cake'>>>repcipe['datePublished']datetime.date(2018,3,10)# Recipe Instructions using the HowToStep>>>recipe['recipeInstructions'][{'@type':'HowToStep','text':'Preheat the oven to 350 degrees F. Grease and flour a 9x9 inch pan.'},{'@type':'HowToStep','text':'In a large bowl, combine flour, sugar, baking powder, and salt.'},{'@type':'HowToStep','text':'Mix in the butter, eggs, and milk.'},{'@type':'HowToStep','text':'Spread into the prepared pan.'},{'@type':'HowToStep','text':'Bake for 30 to 35 minutes, or until firm.'},{'@type':'HowToStep','text':'Allow to cool.'}]
当事情出错时会发生什么
如果网站上没有任何http://schema.org/Recipe格式的食谱。
>>>url='https://www.google.com'>>>recipe_list=scrape_schema_recipe.scrape(url,python_objects=True)>>>len(recipe_list)0
有些网站会导致HTTPError
。
你可以通过加入一个替代的用户代理来避免403禁止的错误。
通过变量user_agent_str
。
功能
load()
-从文件或类似文件的对象加载html schema.org/recipe结构化数据loads()
-从字符串加载html schema.org/recipe结构化数据scrape_url()
-为html schema.org/recipe结构化数据创建一个urlscrape()
-从文件、类似文件的对象、字符串或url中加载html schema.org/recipe结构化数据
Parameters
----------
location : string or file-like object
A url, filename, or text_string of HTML, or a file-like object.
python_objects : bool, list, or tuple (optional)
when True it translates certain data types into python objects
dates into datetime.date, datetimes into datetime.datetimes,
durations as dateime.timedelta.
when set to a list or tuple only converts types specified to
python objects:
* when set to either [dateime.date] or [datetime.datetimes] either will
convert dates.
* when set to [datetime.timedelta] durations will be converted
when False no conversion is performed
(defaults to False)
nonstandard_attrs : bool, optional
when True it adds nonstandard (for schema.org/Recipe) attributes to the
resulting dictionaries, that are outside the specification such as:
'_format' is either 'json-ld' or 'microdata' (how schema.org/Recipe was encoded into HTML)
'_source_url' is the source url, when 'url' has already been defined as another value
(defaults to False)
migrate_old_schema : bool, optional
when True it migrates the schema from older version to current version
(defaults to True)
user_agent_str : string, optional ***only for scrape_url() and scrape()***
overide the user_agent_string with this value.
(defaults to None)
Returns
-------
list
a list of dictionaries in the style of schema.org/Recipe JSON-LD
no results - an empty list will be returned
python控制台中的help()
中也提供了这些功能。
示例函数
通过example_output()
函数,可以快速访问用于原型设计和调试的数据。
它接受与load()相同的参数,但第一个参数name
不同。
>>>fromscrape_schema_recipesimportexample_names,example_output>>>example_names('irish-coffee','google','tart','tea-cake','truffles')>>>recipes=example_output('truffles')>>>recipes[0]['name']'Rum & Tonka Bean Dark Chocolate Truffles'
文件
许可证:apache 2.0参见LICENSE
测试数据属性和许可:ATTRIBUTION.md
开发
单元测试可以由以下人员运行:
schema-recipe-scraper$ python3 test_scrape.py
mypy用于静态类型检查
从项目目录:
schema-recipe-scraper$ mypy schema_recipe_scraper/scrape.py
如果从另一个目录运行mypy,则需要添加--ignore-missing-imports
标志,
因此$ mypy --ignore-missing-imports scrape.py
--ignore-missing-imports
使用标志是因为大多数库都不包含静态类型信息
在他们自己的代码或打字。
参考文档
以下是schema.org/recipe应该如何构造的一些参考资料:
- https://schema.org/Recipe-官方规范
- Recipe Google Search Guide-教开发人员如何使用模式的材料(重点是结构化数据如何影响搜索结果)
其他类似的python库
- recipe_scrapers-库刮擦 食谱使用HTML标签使用美化组。它每一个都有驱动程序 支持的网站。这是一个很好的回退,当模式配方刮刀不能 刮一块地。