apache spark - Using DataFrame with MLlib -
let's have dataframe (that read in csv on hdfs) , want train algorithms on via mllib. how convert rows labeledpoints or otherwise utilize mllib on dataset?
assuming you're using scala:
let's obtain dataframe follows:
val results : dataframe = sqlcontext.sql(...) step 1: call results.printschema() -- show not columns in dataframe , (this important) order, spark sql thinks types. once see output things lot less mysterious.
step 2: rdd[row] out of dataframe:
val rows: rdd[row] = results.rdd step 3: it's matter of pulling whatever fields interest out of individual rows. need know 0-based position of each field , it's type, , luckily obtained in step 1 above. example, let's did select x, y, z, w ... , printing schema yielded
root |-- x double (nullable = ...) |-- y string (nullable = ...) |-- z integer (nullable = ...) |-- w binary (nullable = ...) and let's wanted use x , z. can pull them out rdd[(double, integer)] follows:
rows.map(row => { // x has position 0 , type double // z has position 2 , type integer (row.getdouble(0), row.getint(2)) }) from here use core spark create relevant mllib objects. things little more complicated if sql returns columns of array type, in case you'll have call getlist(...) column.
Comments
Post a Comment