Optimize nutch performance on hadoop cluster -
i'm trying optimize nutch performance crawling sites. test performance on small hadoop cluster, 2 nodes 32gb ram, cpu intel xeon e3 1245v2 4c/8t. config nutch http://pastebin.com/bbrhpfuq
so, problem: fetching jobs works not optimal. reduce task has 4k pages fetching, 1kk pages. example see screenshot https://docs.google.com/file/d/0b98dgnxoqkmvt1doovvpuu1pnxm/edit reduce task finished in 10 minutes, 1 task work 11 hours , still continue working, it's bottle neck when have 24 reduce task, works one.
may can give usefull advices or links can read problem.
it problem in nutch, takes 50 000 000 1 site , 500 000 other. when creating queue host see 1 extremely big queue , other small.
Comments
Post a Comment