perl - How do I extract Amazon reviews from HTML? -

July 15, 2011

i have been trying write perl script scrap amazon , download product reviews have been unable so. have been using perl modules lwp::simple , html::treebuilder::xpath achieve this.

for html

<div id="revdata-dpreviewsmosthelpfulaui-r1gqhd9gmgbdxp" class="a-row a-spacing-small">   <span class="a-size-mini a-color-state a-text-bold">     verified purchase   </span>   <div class="a-section">     bought replace earlier model got lost in transit when moved. real handy helper have when making tortillas. follow recipe flour tortillas in little recipe book comes it. make few changes    </div> </div>  </div> </div>

i wanted extract product review. wrote:-

use lwp::simple;  #use html::treebuilder; use html::treebuilder::xpath;  # take asin command line. $asin = shift @argv or die "usage: perl get_reviews.pl <asin>\n";  # assemble url passed asin. $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";  # set unescape-html rules. quicker uri::escape. %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' '); $unescape_re = join '|' => keys %unescape;  # request url. $content = get($url); die "could not retrieve $url" unless $content; $tree = html::treebuilder::xpath->new_from_content( $content); @data = $tree->findvalues('div[@class ="a-section"]');  foreach (@data) {     print "$_\n"; }

but not getting output. can please point out mistake?

i think xpath should '//div[@class ="a-section"]' (extra // @ beginning of expression find div anywhere in html)

Search This Blog

Plus Code

perl - How do I extract Amazon reviews from HTML? -

Comments

Post a Comment

Popular posts from this blog

r - Trouble relying on third party package imports in my package -

java - Intellij IDEA shortcut How to add new element (ex. class or package)? -

Payment information shows nothing in one page checkout page magento -