python - Amazon web scraping -

August 15, 2011

i'm trying scrape amazon prices phantomjs , python. want parse beautiful soup, new , used prices books, problem is: when pass source of request phantomjs prices 0,00, code simple test.

i'm new in web scraping don't understand if amazon have measures avoid scraping prices or i'm doing wrong because trying other more simple pages , can data want.

pd i'm in country not supported use amazon api, that's why scraper necesary

import re import urlparse  selenium import webdriver bs4 import beautifulsoup time import sleep  link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=utf8&condition=new'#'http://www.amazon.com/gp/product/1119998956'  class amzonscraper(object):     def __init__(self):         self.driver = webdriver.phantomjs()         self.driver.set_window_size(1120, 550)      def scrape_prices(self):         self.driver.get(link)         s = beautifulsoup(self.driver.page_source)         return s      def scrape(self):         source = self.scrape_prices()         print source         self.driver.quit()  if __name__ == '__main__':     scraper = taleojobscraper()     scraper.scrape()

first of all, follow @nick bailey's comment, study terms of use , make sure there no violations on side.

to solve it, need tweak phantomjs desired capabilities:

caps = webdriver.desiredcapabilities.phantomjs caps["phantomjs.page.settings.useragent"] = "mozilla/5.0 (x11; linux x86_64) applewebkit/53 (khtml, gecko) chrome/15.0.87"  self.driver = webdriver.phantomjs(desired_capabilities=caps) self.driver.maximize_window()

and, make bullet-proof, can make custom expected condition , wait price become non-zero:

from selenium.common.exceptions import staleelementreferenceexception selenium.webdriver.common.by import selenium.webdriver.support.ui import webdriverwait selenium.webdriver.support import expected_conditions ec  class wait_for_price(object):     def __init__(self, locator):         self.locator = locator      def __call__(self, driver):         try :             element_text = ec._find_element(driver, self.locator).text.strip()             return element_text != "0,00"         except staleelementreferenceexception:             return false

usage:

def scrape_prices(self):     self.driver.get(link)      webdriverwait(self.driver, 200).until(wait_for_price((by.class_name, "olpofferprice")))     s = beautifulsoup(self.driver.page_source)      return s

Search This Blog

Plus Code

python - Amazon web scraping -

Comments

Post a Comment

Popular posts from this blog

How to group boxplot outliers in gnuplot -

cakephp - simple blog with croogo -

bash - Performing variable substitution in a string -