Does it make sense to build a data processing pipeline using only Kafka? -
i building data processing pipeline using kafka. pipeline linear 4 stages. data volume medium (will need more 1 machine not hundreds or thousands; data volume few tens of gigabytes) question: can use kafka, having pipeline stage consume topic , produce on topic? should using spark or storm , why? of course, prefer simplest possible architecture. if can kafka, i'd prefer that. in future may need additional machine learning stages , may affect answer. have no strong once-only semantics, can accept message loss , duplication no problem.
my question: can use kafka, having pipeline stage consume topic , produce on topic? should using spark or storm , why?
technically yes can. if ready handle whole distributed architecture on own. writing own multi-threaded producers, managing consumers , on. need consider in terms of scalability, performance, durability etc. , here comes beauty of using computation engine storm, spark etc. can concentrate on core logic , leave infrastructure maintained them.
for example using combination of kafka , storm architecture, can store terabytes of data using kafka , feed them storm processing. if familiar storm sample topology can this:
(kafka-spout consuming messages topic) --> ( bolt-a processing data receive through spout & feeding bolt b) --> (bolt-b pushing processed data kafka topic)
using such architecture offers great deal in scalability, throughput, performance etc.making easy configuration changes able tune application based on requirements.
Comments
Post a Comment