提供无法通过标准amazon api访问的内容

amazon_scraper的Python项目详细描述


一个混合的web scraper/api客户端。用web补充标准Amazon API 抓取功能以获取额外数据具体来说,就是产品评论。

使用Amazon Simple Product API 提供api可访问的数据。API搜索函数直接导入到 亚马逊刮板模块。

参数的样式与amazon简单产品api相同,后者 TURN使用宽吻型参数。因此非pythonic参数名(itemid)。

amazonscraper构造函数将把“args”和“kwargs”传递给Bottlenose(通过amazon简单产品api)。 bottlenose支持aws区域、每秒查询限制、查询缓存和其他不错的特性。请查看瓶鼻喷雾剂的API以获取更多信息

最新版本的python amazon simple product api(编写时为1.5.0)不支持这些arguemnt,只支持region。 如果需要,请使用其存储库中的最新代码和以下命令:

pip install git+https://github.com/yoavaviram/python-amazon-simple-product-api.git#egg=python-amazon-simple-product-api

警告

亚马逊不断地试图阻止刮板工作,他们通过以下方式来做到这一点:

  • A/B测试(随机接收不同的HTML)
  • 同一产品类别的大量html布局。
  • 更改HTML布局。
  • 在iframe中移动内容

亚马逊已经开始将越来越多的内容转移到iframe中,而iframe是这个scraper无法处理的。 我设想,如果没有更复杂的逻辑,大多数数据将无法访问。

我花了很长时间试着让这些铲运机工作,这是一场永无休止的战斗。 我没有时间继续跟上亚马逊的步伐。 如果你有兴趣改善亚马逊刮板,请让我知道(创建一个问题是好的)。 如有任何帮助,我们将不胜感激

安装

pip install amazon_scraper

示例

所有产品始终

创建API实例:

>>> from amazon_scraper import AmazonScraper
>>> amzn = AmazonScraper("put your access key", "secret key", "and associate tag here")

创建函数接受“kwargs”,并将其传递给“bottlenose.amazon”构造函数:

>>> from amazon_scraper import AmazonScraper
>>> amzn = AmazonScraper("put your access key", "secret key", "and associate tag here", Region='UK', MaxQPS=0.9, Timeout=5.0)

搜索:

>>> from __future__ import print_function
>>> import itertools
>>> for p in itertools.islice(amzn.search(Keywords='python', SearchIndex='Books'), 5):
>>>     print(p.title)
Learning Python, 5th Edition
Python Programming: An Introduction to Computer Science 2nd Edition
Python In A Day: Learn The Basics, Learn It Quick, Start Coding Fast (In A Day Books) (Volume 1)
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
Python Cookbook

按ASin查找/itemID:

>>> p = amzn.lookup(ItemId='B00FLIJJSA')
>>> p.title
Kindle, Wi-Fi, 6" E Ink Display - for international shipment
>>> p.url
http://www.amazon.com/Kindle-Wi-Fi-Ink-Display-international/dp/B0051QVF7A/ref=cm_cr_pr_product_top

批量查找:

>>> for p in amzn.lookup(ItemId='B0051QVF7A,B007HCCNJU,B00BTI6HBS'):
>>>     print(p.title)
Kindle, Wi-Fi, 6" E Ink Display - for international shipment
Kindle, 6" E Ink Display, Wi-Fi - Includes Special Offers (Black)
Kindle Paperwhite 3G, 6" High Resolution Display with Next-Gen Built-in Light, Free 3G + Wi-Fi - Includes Special Offers

按URL:

>>> p = amzn.lookup(URL='http://www.amazon.com/Kindle-Wi-Fi-Ink-Display-international/dp/B0051QVF7A/ref=cm_cr_pr_product_top')
>>> p.title
Kindle, Wi-Fi, 6" E Ink Display - for international shipment
>>> p.asin
B0051QVF7A

产品评级:

>>> p = amzn.lookup(ItemId='B00FLIJJSA')
>>> p.ratings
[8, 4, 6, 4, 13]

可选绑定:

>>> p = amzn.lookup(ItemId='B000GRFTPS')
>>> p.alternatives
['B00IVM5X7E', '9163192993', '0899669433', 'B00IPXPQ9O', '1482998742', '0441444814', '1497344824']
>>> for asin in p.alternatives:
>>>     alt = amzn.lookup(ItemId=asin)
>>>     print(alt.title, alt.binding)
The King in Yellow Kindle Edition
The King in Yellow Unknown Binding
King in Yellow Hardcover
The Yellow Sign Audible Audio Edition
The King in Yellow MP3 CD
THE KING IN YELLOW Mass Market Paperback
The King in Yellow Paperback

无法通过API获得补充文本:

>>> p = amzn.lookup(ItemId='0441016685')
>>> p.supplemental_text
[u"Bob Howard is a computer-hacker desk jockey ... ", u"Lovecraft\'s Cthulhu meets Len Deighton\'s spies ... ", u"This dark, funny blend of SF and ... "]

审查API

查看评论列表:

>>> p = amzn.lookup(ItemId='B0051QVF7A')
>>> rs = p.reviews()
>>> rs.asin
B0051QVF7A
>>> # print the reviews on this first page
>>> rs.ids
['R3MF0NIRI3BT1E', 'R3N2XPJT4I1XTI', 'RWG7OQ5NMGUMW', 'R1FKKJWTJC4EAP', 'RR8NWZ0IXWX7K', 'R32AU655LW6HPU', 'R33XK7OO7TO68E', 'R3NJRC6XH88RBR', 'R21JS32BNNQ82O', 'R2C9KPSEH78IF7']
>>> rs.url
http://www.amazon.com/product-reviews/B0051QVF7A/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending
>>> # iterate over reviews on this page only
>>> for r in rs.brief_reviews:
>>>     print(r.id)
'R3MF0NIRI3BT1E'
'R3N2XPJT4I1XTI'
'RWG7OQ5NMGUMW'
...
>>> # iterate over all brief reviews on all pages
>>> for r in rs:
>>>     print(r.id)
'R3MF0NIRI3BT1E'
'R3N2XPJT4I1XTI'
'RWG7OQ5NMGUMW'
...

查看详细评论:

>>> rs = amzn.reviews(ItemId='B0051QVF7A')
>>> # this will iterate over all reviews on all pages
>>> # each review will require a download as it is on a seperate page
>>> for r in rs.full_reviews():
>>>     print(r.id)
'R3MF0NIRI3BT1E'
'R3N2XPJT4I1XTI'
'RWG7OQ5NMGUMW'
...

将简短回顾转换为完整回顾:

>>> rs = amzn.reviews(ItemId='B0051QVF7A')
>>> # this will iterate over all reviews on all pages
>>> # each review will require a download as it is on a seperate page
>>> for r in rs:
>>>     print(r.id)
>>>     fr = r.full_review()
>>>     print(fr.id)

使用all_reviews属性快速获取审阅页上所有审阅的列表。 这将使用Review页面上提供的简短评论来避免单独下载每个评论。因此,一些信息 可能无法访问:

>>> p = amzn.lookup(ItemId='B0051QVF7A')
>>> rs = p.reviews()
>>> all_reviews_on_page = list(rs)
>>> len(all_reviews_on_page)
10
>>> r = all_reviews_on_page[0]
>>> r.title
'Fantastic device - pick your Kindle!'
>>> fr = r.full_review()
>>> fr.title
'Fantastic device - pick your Kindle!'

按ASin/itemID:

>>> rs = amzn.reviews(ItemId='B0051QVF7A')
>>> rs.asin
B0051QVF7A
>>> rs.ids
['R3MF0NIRI3BT1E', 'R3N2XPJT4I1XTI', 'RWG7OQ5NMGUMW', 'R1FKKJWTJC4EAP', 'RR8NWZ0IXWX7K', 'R32AU655LW6HPU', 'R33XK7OO7TO68E', 'R3NJRC6XH88RBR', 'R21JS32BNNQ82O', 'R2C9KPSEH78IF7']

对于个人评论,请使用review方法:

>>> review_id = 'R3MF0NIRI3BT1E'
>>> r = amzn.review(Id=review_id)
>>> r.id
R3MF0NIRI3BT1E
>>> r.asin
B00492CIC8
>>> r.url
http://www.amazon.com/review/R3MF0NIRI3BT1E
>>> r.date
2011-09-29 18:27:14+00:00
>>> r.author
FreeSpirit
>>> r.text
Having been a little overwhelmed by the choices between all the new Kindles ... <snip>

按URL:

>>> r = amzn.review(URL='http://www.amazon.com/review/R3MF0NIRI3BT1E')
>>> r.id
R3MF0NIRI3BT1E

用户评论API

此软件包还支持获取特定用户编写的评论。

获取单个作者创建的评论:

>>> ur = amzn.user_reviews(Id="A2W0GY64CJSV5D")
>>> ur.brief_reviews
>>> ur.name
>>> fr = list(ur.brief_reviews)[0].full_review()

从review对象获取用户的评论

>>> r = amzn.review(Id="R3MF0NIRI3BT1E")
>>> # we can get the reviews directly, or via the API with a URL or ID
>>> ur = r.user_reviews()
>>> ur = amzn.user_reviews(URL=r.author_reviews_url)
>>> ur = amzn.user_reviews(Id=r.author_id)
>>> ur.brief_reviews
>>> ur.name

重复当前页面的评论:

>>> ur = amzn.user_reviews(Id="A2W0GY64CJSV5D")
>>> for r in ur.brief_reviews:
>>>     print(r.id)

遍历所有作者评论:

>>> ur = amzn.user_reviews(Id="A2W0GY64CJSV5D")
>>> for r in ur:
>>>     print(r.id)

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
在Java中从本地文件系统导入文件   spring boot如何在Java SpringBoot项目中集成Olingo(Odata)   java查找连续数组中缺少的第k个元素(超过时间限制)   java为什么在mySql中插入1/2行时会得到2/4行   java不能在静态上下文中使用它   File Observer方法的java My onEvent()部分不起作用   java Netty NioSocketChannel在多线程写入时收到中断消息   java将文件夹与父文件夹一起复制   java我的TictaToe代码出了什么问题?如何检查已采取的措施?   java Swing JTable更新   java如何将cordinates查找为int   如何使用selenium和java在firefox中打开新的空选项卡   java Gradle构建输出Jar未运行   java没有GET/WEBINF/jsp/login的映射。jsp