python - avoiding scraping data from pages already scraped -

good evening all,

i still working on spider scrape data news sites have run problem, original question posted here: scrapy outputs [ .json file has been solved.

i have managed little further, having had make allowances empty items , adding search functionality trying scrape articles have not yet scraped, (baring in mind may still want extract links them). can't figure out put code will:

a.) define when last crawl done b.) compare date of article date of last crawl.

i may struggling logic , turn you.

my spider:

# tabbing in python apparently important aware , make sure  # things should line  # import crawlspider class, along it's rules, (this lets recursively # crawl pages)  scrapy.contrib.spiders import crawlspider, rule  #import link extractor, extracts links pages  scrapy.contrib.linkextractors import linkextractor  # import our items defined in items.py  basic.items import basicitem  # import datetime can current date , time  import time  # import re allows compare strings  import re  # create new spider crawlspider class  class basicspiderspider(crawlspider):      # name of spider, used run it, (i.e scrapy crawl basic_spider)      name = "basic_spider"      # domains spider allowed crawl on      allowed_domains = ["news24.com"]      # start crawling      start_urls = [         'http://www.news24.com',     ]      # rules link extractor, (i.e it's allowed links,      # once it's found them, , whether it's allowed follow them      rules = (rule (linkextractor(), callback="parse_items", follow= true),     )      # defining callback function      def parse_items(self, response):          # defines top level xpath of our information can found, needs         # specific possible avoid duplicates          title in response.xpath('//*[@id="aspnetform"]'):              # list of keywords search through.              key = re.compile("joburg|durban", re.ignorecase)              # extracting data compare keywords,              # headlines, join converts list type string type              headlist = title.xpath('//*[@id="article_special"]//h1/text()').extract()             head = ''.join(headlist)              # , article.              artlist = title.xpath('//*[@id="article-body"]//text()').extract()             art = ''.join(artlist)              # if keywords found in headline:              if key.search(head):                 if last_crawled > response.xpath('//*[@id="spndate"]/text()').extract()                     # define top level xpath again python won't outside                      # it's current fuction                      thing in response.xpath('//*[@id="aspnetform"]'):                          # fills items defined in items.py relevant data                          item = basicitem()                         item['headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract()                         item["article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract()                         item["date"] = thing.xpath('//*[@id="spndate"]/text()').extract()                         item["link"] = response.url                          # found being careful xpaths                         # still got empty fields , lines, below fixes                          if item['headline']:                             if item["article"]:                                 if item["date"]:                                     last_crawled = (time.strftime("%y-%m-%d %h:%m"))                                     yield item              # if headline item doesn't match, check article item.              elif key.search(art):                 if last_crawled > response.xpath('//*[@id="spndate"]/text()').extract()                     thing in response.xpath('//*[@id="aspnetform"]'):                         item = basicitem()                         item['headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract()                         item["article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract()                         item["date"] = thing.xpath('//*[@id="spndate"]/text()').extract()                         item["link"] = response.url                          if item['headline']:                             if item["article"]:                                 if item["date"]:                                     last_crawled = (time.strftime("%y-%m-%d %h:%m"))                                     yield item

it's not working mentioned i'm skeptical of logic anyway, can let me know if i'm on right track here?

thanks again help.

you seem using last_crawled out of context. don't bother it, better of using deltafetch middleware, created trying do:

this spider middleware ignore requests pages containing items seen in previous crawls of same spider, producing "delta crawl" containing new items.

to use deltafetch, install scrapylib first:

pip install scrapylib

and after that, enable in settings.py:

spider_middlewares = {     'scrapylib.deltafetch.deltafetch': 100, }  deltafetch_enabled = true

Search This Blog

Plus Code

python - avoiding scraping data from pages already scraped -

Comments

Post a Comment

Popular posts from this blog

r - Trouble relying on third party package imports in my package -

java - Intellij IDEA shortcut How to add new element (ex. class or package)? -

Payment information shows nothing in one page checkout page magento -