提供无法通过标准amazon api访问的内容
amazon_scraper的Python项目详细描述
一个混合的web scraper/api客户端。用web补充标准Amazon API 抓取功能以获取额外数据具体来说,就是产品评论。
使用Amazon Simple Product API 提供api可访问的数据。API搜索函数直接导入到 亚马逊刮板模块。
参数的样式与amazon简单产品api相同,后者 TURN使用宽吻型参数。因此非pythonic参数名(itemid)。
amazonscraper构造函数将把“args”和“kwargs”传递给Bottlenose(通过amazon简单产品api)。 bottlenose支持aws区域、每秒查询限制、查询缓存和其他不错的特性。请查看瓶鼻喷雾剂的API以获取更多信息
最新版本的python amazon simple product api(编写时为1.5.0)不支持这些arguemnt,只支持region。 如果需要,请使用其存储库中的最新代码和以下命令:
pip install git+https://github.com/yoavaviram/python-amazon-simple-product-api.git#egg=python-amazon-simple-product-api
警告
亚马逊不断地试图阻止刮板工作,他们通过以下方式来做到这一点:
- A/B测试(随机接收不同的HTML)
- 同一产品类别的大量html布局。
- 更改HTML布局。
- 在iframe中移动内容
亚马逊已经开始将越来越多的内容转移到iframe中,而iframe是这个scraper无法处理的。 我设想,如果没有更复杂的逻辑,大多数数据将无法访问。
我花了很长时间试着让这些铲运机工作,这是一场永无休止的战斗。 我没有时间继续跟上亚马逊的步伐。 如果你有兴趣改善亚马逊刮板,请让我知道(创建一个问题是好的)。 如有任何帮助,我们将不胜感激
安装
pip install amazon_scraper
示例
所有产品始终
创建API实例:
>>> from amazon_scraper import AmazonScraper >>> amzn = AmazonScraper("put your access key", "secret key", "and associate tag here")
创建函数接受“kwargs”,并将其传递给“bottlenose.amazon”构造函数:
>>> from amazon_scraper import AmazonScraper >>> amzn = AmazonScraper("put your access key", "secret key", "and associate tag here", Region='UK', MaxQPS=0.9, Timeout=5.0)
搜索:
>>> from __future__ import print_function >>> import itertools >>> for p in itertools.islice(amzn.search(Keywords='python', SearchIndex='Books'), 5): >>> print(p.title) Learning Python, 5th Edition Python Programming: An Introduction to Computer Science 2nd Edition Python In A Day: Learn The Basics, Learn It Quick, Start Coding Fast (In A Day Books) (Volume 1) Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython Python Cookbook
按ASin查找/itemID:
>>> p = amzn.lookup(ItemId='B00FLIJJSA') >>> p.title Kindle, Wi-Fi, 6" E Ink Display - for international shipment >>> p.url http://www.amazon.com/Kindle-Wi-Fi-Ink-Display-international/dp/B0051QVF7A/ref=cm_cr_pr_product_top
批量查找:
>>> for p in amzn.lookup(ItemId='B0051QVF7A,B007HCCNJU,B00BTI6HBS'): >>> print(p.title) Kindle, Wi-Fi, 6" E Ink Display - for international shipment Kindle, 6" E Ink Display, Wi-Fi - Includes Special Offers (Black) Kindle Paperwhite 3G, 6" High Resolution Display with Next-Gen Built-in Light, Free 3G + Wi-Fi - Includes Special Offers
按URL:
>>> p = amzn.lookup(URL='http://www.amazon.com/Kindle-Wi-Fi-Ink-Display-international/dp/B0051QVF7A/ref=cm_cr_pr_product_top') >>> p.title Kindle, Wi-Fi, 6" E Ink Display - for international shipment >>> p.asin B0051QVF7A
产品评级:
>>> p = amzn.lookup(ItemId='B00FLIJJSA') >>> p.ratings [8, 4, 6, 4, 13]
可选绑定:
>>> p = amzn.lookup(ItemId='B000GRFTPS') >>> p.alternatives ['B00IVM5X7E', '9163192993', '0899669433', 'B00IPXPQ9O', '1482998742', '0441444814', '1497344824'] >>> for asin in p.alternatives: >>> alt = amzn.lookup(ItemId=asin) >>> print(alt.title, alt.binding) The King in Yellow Kindle Edition The King in Yellow Unknown Binding King in Yellow Hardcover The Yellow Sign Audible Audio Edition The King in Yellow MP3 CD THE KING IN YELLOW Mass Market Paperback The King in Yellow Paperback
无法通过API获得补充文本:
>>> p = amzn.lookup(ItemId='0441016685') >>> p.supplemental_text [u"Bob Howard is a computer-hacker desk jockey ... ", u"Lovecraft\'s Cthulhu meets Len Deighton\'s spies ... ", u"This dark, funny blend of SF and ... "]
审查API
查看评论列表:
>>> p = amzn.lookup(ItemId='B0051QVF7A') >>> rs = p.reviews() >>> rs.asin B0051QVF7A >>> # print the reviews on this first page >>> rs.ids ['R3MF0NIRI3BT1E', 'R3N2XPJT4I1XTI', 'RWG7OQ5NMGUMW', 'R1FKKJWTJC4EAP', 'RR8NWZ0IXWX7K', 'R32AU655LW6HPU', 'R33XK7OO7TO68E', 'R3NJRC6XH88RBR', 'R21JS32BNNQ82O', 'R2C9KPSEH78IF7'] >>> rs.url http://www.amazon.com/product-reviews/B0051QVF7A/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending >>> # iterate over reviews on this page only >>> for r in rs.brief_reviews: >>> print(r.id) 'R3MF0NIRI3BT1E' 'R3N2XPJT4I1XTI' 'RWG7OQ5NMGUMW' ... >>> # iterate over all brief reviews on all pages >>> for r in rs: >>> print(r.id) 'R3MF0NIRI3BT1E' 'R3N2XPJT4I1XTI' 'RWG7OQ5NMGUMW' ...
查看详细评论:
>>> rs = amzn.reviews(ItemId='B0051QVF7A') >>> # this will iterate over all reviews on all pages >>> # each review will require a download as it is on a seperate page >>> for r in rs.full_reviews(): >>> print(r.id) 'R3MF0NIRI3BT1E' 'R3N2XPJT4I1XTI' 'RWG7OQ5NMGUMW' ...
将简短回顾转换为完整回顾:
>>> rs = amzn.reviews(ItemId='B0051QVF7A') >>> # this will iterate over all reviews on all pages >>> # each review will require a download as it is on a seperate page >>> for r in rs: >>> print(r.id) >>> fr = r.full_review() >>> print(fr.id)
使用all_reviews属性快速获取审阅页上所有审阅的列表。 这将使用Review页面上提供的简短评论来避免单独下载每个评论。因此,一些信息 可能无法访问:
>>> p = amzn.lookup(ItemId='B0051QVF7A') >>> rs = p.reviews() >>> all_reviews_on_page = list(rs) >>> len(all_reviews_on_page) 10 >>> r = all_reviews_on_page[0] >>> r.title 'Fantastic device - pick your Kindle!' >>> fr = r.full_review() >>> fr.title 'Fantastic device - pick your Kindle!'
按ASin/itemID:
>>> rs = amzn.reviews(ItemId='B0051QVF7A') >>> rs.asin B0051QVF7A >>> rs.ids ['R3MF0NIRI3BT1E', 'R3N2XPJT4I1XTI', 'RWG7OQ5NMGUMW', 'R1FKKJWTJC4EAP', 'RR8NWZ0IXWX7K', 'R32AU655LW6HPU', 'R33XK7OO7TO68E', 'R3NJRC6XH88RBR', 'R21JS32BNNQ82O', 'R2C9KPSEH78IF7']
对于个人评论,请使用review方法:
>>> review_id = 'R3MF0NIRI3BT1E' >>> r = amzn.review(Id=review_id) >>> r.id R3MF0NIRI3BT1E >>> r.asin B00492CIC8 >>> r.url http://www.amazon.com/review/R3MF0NIRI3BT1E >>> r.date 2011-09-29 18:27:14+00:00 >>> r.author FreeSpirit >>> r.text Having been a little overwhelmed by the choices between all the new Kindles ... <snip>
按URL:
>>> r = amzn.review(URL='http://www.amazon.com/review/R3MF0NIRI3BT1E') >>> r.id R3MF0NIRI3BT1E
用户评论API
此软件包还支持获取特定用户编写的评论。
获取单个作者创建的评论:
>>> ur = amzn.user_reviews(Id="A2W0GY64CJSV5D") >>> ur.brief_reviews >>> ur.name >>> fr = list(ur.brief_reviews)[0].full_review()
从review对象获取用户的评论
>>> r = amzn.review(Id="R3MF0NIRI3BT1E") >>> # we can get the reviews directly, or via the API with a URL or ID >>> ur = r.user_reviews() >>> ur = amzn.user_reviews(URL=r.author_reviews_url) >>> ur = amzn.user_reviews(Id=r.author_id) >>> ur.brief_reviews >>> ur.name
重复当前页面的评论:
>>> ur = amzn.user_reviews(Id="A2W0GY64CJSV5D") >>> for r in ur.brief_reviews: >>> print(r.id)
遍历所有作者评论:
>>> ur = amzn.user_reviews(Id="A2W0GY64CJSV5D") >>> for r in ur: >>> print(r.id)