python - Amazon web scraping -
i'm trying scrape amazon prices phantomjs , python. want parse beautiful soup, new , used prices books, problem is: when pass source of request phantomjs prices 0,00, code simple test.
i'm new in web scraping don't understand if amazon have measures avoid scraping prices or i'm doing wrong because trying other more simple pages , can data want.
pd i'm in country not supported use amazon api, that's why scraper necesary
import re import urlparse selenium import webdriver bs4 import beautifulsoup time import sleep link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=utf8&condition=new'#'http://www.amazon.com/gp/product/1119998956' class amzonscraper(object): def __init__(self): self.driver = webdriver.phantomjs() self.driver.set_window_size(1120, 550) def scrape_prices(self): self.driver.get(link) s = beautifulsoup(self.driver.page_source) return s def scrape(self): source = self.scrape_prices() print source self.driver.quit() if __name__ == '__main__': scraper = taleojobscraper() scraper.scrape()
first of all, follow @nick bailey's comment, study terms of use , make sure there no violations on side.
to solve it, need tweak phantomjs desired capabilities:
caps = webdriver.desiredcapabilities.phantomjs caps["phantomjs.page.settings.useragent"] = "mozilla/5.0 (x11; linux x86_64) applewebkit/53 (khtml, gecko) chrome/15.0.87" self.driver = webdriver.phantomjs(desired_capabilities=caps) self.driver.maximize_window() and, make bullet-proof, can make custom expected condition , wait price become non-zero:
from selenium.common.exceptions import staleelementreferenceexception selenium.webdriver.common.by import selenium.webdriver.support.ui import webdriverwait selenium.webdriver.support import expected_conditions ec class wait_for_price(object): def __init__(self, locator): self.locator = locator def __call__(self, driver): try : element_text = ec._find_element(driver, self.locator).text.strip() return element_text != "0,00" except staleelementreferenceexception: return false usage:
def scrape_prices(self): self.driver.get(link) webdriverwait(self.driver, 200).until(wait_for_price((by.class_name, "olpofferprice"))) s = beautifulsoup(self.driver.page_source) return s
Comments
Post a Comment