从https://schema.org/recipe格式的html结构化数据中提取烹饪配方。

scrape-schema-recipe的Python项目详细描述


刮模式配方

Build Status

将htmlhttps://schema.org/Recipe(microdata/json-ld)中的配方刮到python字典中。

安装

pip install scrape-schema-recipe

要求

python版本3.5+

这个库在很大程度上依赖于extruct

其他要求:

  • 等日期(>;=0.5.1)
  • 请求
  • 验证器(>;=12.4)。

联机示例

>>>importscrape_schema_recipe>>>url='https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'>>>recipe_list=scrape_schema_recipe.scrape_url(url,python_objects=True)>>>len(recipe_list)1>>>recipe=recipe_list[0]# Name of the recipe>>>recipe['name']'Honey Mustard Dressing'# List of the Ingredients>>>recipe['recipeIngredient']['5 tablespoons medium body honey (sourwood is nice)','3 tablespoons smooth Dijon mustard','2 tablespoons rice wine vinegar']# List of the Instructions>>>recipe['recipeInstructions']['Combine all ingredients in a bowl and whisk until smooth. Serve as a dressing or a dip.']# Author>>>recipe['author'][{'@type':'Person','name':'Alton Brown','url':'https://www.foodnetwork.com/profiles/talent/alton-brown'}]

“@type”:“person”是一个https://schema.org/Person对象

# Preparation Time>>>recipe['prepTime']datetime.timedelta(0,300)# The library pendulum can give you something a little easier to read.>>>importpendulum# for pendulum version 1.0>>>pendulum.Interval.instanceof(recipe['prepTime'])<Interval[5minutes]># for version 2.0 of pendulum>>>pendulum.Duration(seconds=recipe['prepTime'].total_seconds())<Duration[5minutes]>

如果将python_objects设置为False,则将返回表示持续时间的字符串iso8611,'PT5M'

pendulum's library website

# Publication date>>>recipe['datePublished']datetime.datetime(2016,11,13,21,5,50,518000,tzinfo=<FixedOffset'-05:00'>)>>>str(recipe['datePublished'])'2016-11-13 21:05:50.518000-05:00'# Identifying this is http://schema.org/Recipe data (in LD-JSON format)>>>recipe['@context'],recipe['@type']('http://schema.org','Recipe')# Content's URL>>>recipe['url']'https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'# all the keys in this dictionary>>>recipe.keys()dict_keys(['recipeYield','totalTime','dateModified','url','@context','name','publisher','prepTime','datePublished','recipeIngredient','@type','recipeInstructions','author','mainEntityOfPage','aggregateRating','recipeCategory','image','headline','review'])

来自文件的示例(可选表示)

也适用于本地保存的HTML example file

>>>filelocation='test_data/google-recipe-example.html'>>>recipe_list=scrape_schema_recipe.scrape(filelocation,python_objects=True)>>>recipe=recipe_list[0]>>>recipe['name']'Party Coffee Cake'>>>repcipe['datePublished']datetime.date(2018,3,10)# Recipe Instructions using the HowToStep>>>recipe['recipeInstructions'][{'@type':'HowToStep','text':'Preheat the oven to 350 degrees F. Grease and flour a 9x9 inch pan.'},{'@type':'HowToStep','text':'In a large bowl, combine flour, sugar, baking powder, and salt.'},{'@type':'HowToStep','text':'Mix in the butter, eggs, and milk.'},{'@type':'HowToStep','text':'Spread into the prepared pan.'},{'@type':'HowToStep','text':'Bake for 30 to 35 minutes, or until firm.'},{'@type':'HowToStep','text':'Allow to cool.'}]

当事情出错时会发生什么

如果网站上没有任何http://schema.org/Recipe格式的食谱。

>>>url='https://www.google.com'>>>recipe_list=scrape_schema_recipe.scrape(url,python_objects=True)>>>len(recipe_list)0

有些网站会导致HTTPError

你可以通过加入一个替代的用户代理来避免403禁止的错误。 通过变量user_agent_str

功能

  • load()-从文件或类似文件的对象加载html schema.org/recipe结构化数据
  • loads()-从字符串加载html schema.org/recipe结构化数据
  • scrape_url()-为html schema.org/recipe结构化数据创建一个url
  • scrape()-从文件、类似文件的对象、字符串或url中加载html schema.org/recipe结构化数据
    Parameters
    ----------
    location : string or file-like object
        A url, filename, or text_string of HTML, or a file-like object.

    python_objects : bool, list, or tuple  (optional)
        when True it translates certain data types into python objects
          dates into datetime.date, datetimes into datetime.datetimes,
          durations as dateime.timedelta.
        when set to a list or tuple only converts types specified to
          python objects:
            * when set to either [dateime.date] or [datetime.datetimes] either will
              convert dates.
            * when set to [datetime.timedelta] durations will be converted
        when False no conversion is performed
        (defaults to False)

    nonstandard_attrs : bool, optional
        when True it adds nonstandard (for schema.org/Recipe) attributes to the
        resulting dictionaries, that are outside the specification such as:
            '_format' is either 'json-ld' or 'microdata' (how schema.org/Recipe was encoded into HTML)
            '_source_url' is the source url, when 'url' has already been defined as another value
        (defaults to False)

    migrate_old_schema : bool, optional
        when True it migrates the schema from older version to current version
        (defaults to True)

    user_agent_str : string, optional  ***only for scrape_url() and scrape()***
        overide the user_agent_string with this value.
        (defaults to None)

    Returns
    -------
    list
        a list of dictionaries in the style of schema.org/Recipe JSON-LD
        no results - an empty list will be returned

python控制台中的help()中也提供了这些功能。

示例函数

通过example_output()函数,可以快速访问用于原型设计和调试的数据。 它接受与load()相同的参数,但第一个参数name不同。

>>>fromscrape_schema_recipesimportexample_names,example_output>>>example_names('irish-coffee','google','tart','tea-cake','truffles')>>>recipes=example_output('truffles')>>>recipes[0]['name']'Rum & Tonka Bean Dark Chocolate Truffles'

文件

许可证:apache 2.0参见LICENSE

测试数据属性和许可:ATTRIBUTION.md

开发

单元测试可以由以下人员运行:

schema-recipe-scraper$ python3 test_scrape.py

mypy用于静态类型检查

从项目目录:

 schema-recipe-scraper$ mypy schema_recipe_scraper/scrape.py

如果从另一个目录运行mypy,则需要添加--ignore-missing-imports标志, 因此$ mypy --ignore-missing-imports scrape.py

--ignore-missing-imports使用标志是因为大多数库都不包含静态类型信息 在他们自己的代码或打字。

参考文档

以下是schema.org/recipe应该如何构造的一些参考资料:

其他类似的python库

  • recipe_scrapers-库刮擦 食谱使用HTML标签使用美化组。它每一个都有驱动程序 支持的网站。这是一个很好的回退,当模式配方刮刀不能 刮一块地。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java Admob初始化失败,应用程序在启动时崩溃,即使没有错误   java如何在ibatis中使用存储过程?   java干净体系结构:在不同的环境中分离IO和核心。jar文件   Java streams compare属性在两个列表之间相等,并返回true或false   有没有一种压缩Java Try-Catch块的方法?   Android片段中的java Mapbox SDK   用于IzPack安装的JavaFX本机启动器:控制InnoSetup/WiX/RPMBuild行为   java接口是否可以将一个已经实例化的对象作为属性?   hashmap中的java线程问题   请求中的java字符编码。getRemoteUser()   java将OneTONE链接更改为另一个id为的实体   java当使用javaw启动应用程序时,不会执行关闭钩子。exe   Android java代码更改显示的微调器   java GZIP解压字符串和字节转换   运行springbatch后java LDAP运行状况检查失败