apache spark - Using DataFrame with MLlib -

March 15, 2012

let's have dataframe (that read in csv on hdfs) , want train algorithms on via mllib. how convert rows labeledpoints or otherwise utilize mllib on dataset?

assuming you're using scala:

let's obtain dataframe follows:

val results : dataframe = sqlcontext.sql(...)

step 1: call results.printschema() -- show not columns in dataframe , (this important) order, spark sql thinks types. once see output things lot less mysterious.

step 2: rdd[row] out of dataframe:

val rows: rdd[row] = results.rdd

step 3: it's matter of pulling whatever fields interest out of individual rows. need know 0-based position of each field , it's type, , luckily obtained in step 1 above. example, let's did select x, y, z, w ... , printing schema yielded

root |-- x double (nullable = ...) |-- y string (nullable = ...) |-- z integer (nullable = ...) |-- w binary (nullable = ...)

and let's wanted use x , z. can pull them out rdd[(double, integer)] follows:

rows.map(row => {     // x has position 0 , type double     // z has position 2 , type integer     (row.getdouble(0), row.getint(2)) })

from here use core spark create relevant mllib objects. things little more complicated if sql returns columns of array type, in case you'll have call getlist(...) column.

Search This Blog

Plus Code

apache spark - Using DataFrame with MLlib -

Comments

Post a Comment

Popular posts from this blog

How to group boxplot outliers in gnuplot -

cakephp - simple blog with croogo -

bash - Performing variable substitution in a string -