apache spark - Using DataFrame with MLlib -


let's have dataframe (that read in csv on hdfs) , want train algorithms on via mllib. how convert rows labeledpoints or otherwise utilize mllib on dataset?

assuming you're using scala:

let's obtain dataframe follows:

val results : dataframe = sqlcontext.sql(...) 

step 1: call results.printschema() -- show not columns in dataframe , (this important) order, spark sql thinks types. once see output things lot less mysterious.

step 2: rdd[row] out of dataframe:

val rows: rdd[row] = results.rdd 

step 3: it's matter of pulling whatever fields interest out of individual rows. need know 0-based position of each field , it's type, , luckily obtained in step 1 above. example, let's did select x, y, z, w ... , printing schema yielded

root |-- x double (nullable = ...) |-- y string (nullable = ...) |-- z integer (nullable = ...) |-- w binary (nullable = ...) 

and let's wanted use x , z. can pull them out rdd[(double, integer)] follows:

rows.map(row => {     // x has position 0 , type double     // z has position 2 , type integer     (row.getdouble(0), row.getint(2)) }) 

from here use core spark create relevant mllib objects. things little more complicated if sql returns columns of array type, in case you'll have call getlist(...) column.


Comments

Popular posts from this blog

cakephp - simple blog with croogo -

How to group boxplot outliers in gnuplot -

bash - Performing variable substitution in a string -