scala - Categorical Variables in Apache Spark using MLib -


i relatively new world of apache spark. trying estimate large scale model using linearregressionwithsgd() estimate fixed effects , interaction terms without having create huge design matrix.

i noticed there implementation supporting categorical variables in decisiontree

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/decisiontree.scala#l293

that creates hash map strings integers , feeds model. has attempted similar exercise linear models in spark?

thanks.

you can use one-hot encoding convert categorical variable feature space can feed linear regression model.

for instance, if have categorical variable values: low, medium, high, can encode in 3 different integer features below:

category    low medium high low         1   0      0 medium      0   1      0 high        0   0      1 

this method that, there other approaches if categorical values aren't large, one-hot encoding fit.


Comments

Popular posts from this blog

cakephp - simple blog with croogo -

How to group boxplot outliers in gnuplot -

bash - Performing variable substitution in a string -