Does it make sense to build a data processing pipeline using only Kafka? -


i building data processing pipeline using kafka. pipeline linear 4 stages. data volume medium (will need more 1 machine not hundreds or thousands; data volume few tens of gigabytes) question: can use kafka, having pipeline stage consume topic , produce on topic? should using spark or storm , why? of course, prefer simplest possible architecture. if can kafka, i'd prefer that. in future may need additional machine learning stages , may affect answer. have no strong once-only semantics, can accept message loss , duplication no problem.

my question: can use kafka, having pipeline stage consume topic , produce on topic? should using spark or storm , why?

technically yes can. if ready handle whole distributed architecture on own. writing own multi-threaded producers, managing consumers , on. need consider in terms of scalability, performance, durability etc. , here comes beauty of using computation engine storm, spark etc. can concentrate on core logic , leave infrastructure maintained them.

for example using combination of kafka , storm architecture, can store terabytes of data using kafka , feed them storm processing. if familiar storm sample topology can this:

(kafka-spout consuming messages topic) --> ( bolt-a processing data receive through spout & feeding bolt b) --> (bolt-b pushing processed data kafka topic)

using such architecture offers great deal in scalability, throughput, performance etc.making easy configuration changes able tune application based on requirements.


Comments

Popular posts from this blog

cakephp - simple blog with croogo -

How to group boxplot outliers in gnuplot -

bash - Performing variable substitution in a string -