python - avoiding scraping data from pages already scraped -
good evening all,
i still working on spider scrape data news sites have run problem, original question posted here: scrapy outputs [ .json file has been solved.
i have managed little further, having had make allowances empty items , adding search functionality trying scrape articles have not yet scraped, (baring in mind may still want extract links them). can't figure out put code will:
a.) define when last crawl done b.) compare date of article date of last crawl.
i may struggling logic , turn you.
my spider:
# tabbing in python apparently important aware , make sure # things should line # import crawlspider class, along it's rules, (this lets recursively # crawl pages) scrapy.contrib.spiders import crawlspider, rule #import link extractor, extracts links pages scrapy.contrib.linkextractors import linkextractor # import our items defined in items.py basic.items import basicitem # import datetime can current date , time import time # import re allows compare strings import re # create new spider crawlspider class class basicspiderspider(crawlspider): # name of spider, used run it, (i.e scrapy crawl basic_spider) name = "basic_spider" # domains spider allowed crawl on allowed_domains = ["news24.com"] # start crawling start_urls = [ 'http://www.news24.com', ] # rules link extractor, (i.e it's allowed links, # once it's found them, , whether it's allowed follow them rules = (rule (linkextractor(), callback="parse_items", follow= true), ) # defining callback function def parse_items(self, response): # defines top level xpath of our information can found, needs # specific possible avoid duplicates title in response.xpath('//*[@id="aspnetform"]'): # list of keywords search through. key = re.compile("joburg|durban", re.ignorecase) # extracting data compare keywords, # headlines, join converts list type string type headlist = title.xpath('//*[@id="article_special"]//h1/text()').extract() head = ''.join(headlist) # , article. artlist = title.xpath('//*[@id="article-body"]//text()').extract() art = ''.join(artlist) # if keywords found in headline: if key.search(head): if last_crawled > response.xpath('//*[@id="spndate"]/text()').extract() # define top level xpath again python won't outside # it's current fuction thing in response.xpath('//*[@id="aspnetform"]'): # fills items defined in items.py relevant data item = basicitem() item['headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract() item["article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract() item["date"] = thing.xpath('//*[@id="spndate"]/text()').extract() item["link"] = response.url # found being careful xpaths # still got empty fields , lines, below fixes if item['headline']: if item["article"]: if item["date"]: last_crawled = (time.strftime("%y-%m-%d %h:%m")) yield item # if headline item doesn't match, check article item. elif key.search(art): if last_crawled > response.xpath('//*[@id="spndate"]/text()').extract() thing in response.xpath('//*[@id="aspnetform"]'): item = basicitem() item['headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract() item["article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract() item["date"] = thing.xpath('//*[@id="spndate"]/text()').extract() item["link"] = response.url if item['headline']: if item["article"]: if item["date"]: last_crawled = (time.strftime("%y-%m-%d %h:%m")) yield item
it's not working mentioned i'm skeptical of logic anyway, can let me know if i'm on right track here?
thanks again help.
you seem using last_crawled
out of context. don't bother it, better of using deltafetch middleware, created trying do:
this spider middleware ignore requests pages containing items seen in previous crawls of same spider, producing "delta crawl" containing new items.
to use deltafetch
, install scrapylib
first:
pip install scrapylib
and after that, enable in settings.py
:
spider_middlewares = { 'scrapylib.deltafetch.deltafetch': 100, } deltafetch_enabled = true
Comments
Post a Comment