scala - How to change column types in Spark SQL's DataFrame? -
suppose i'm doing like:
val df = sqlcontext.load("com.databricks.spark.csv", map("path" -> "cars.csv", "header" -> "true")) df.printschema() root |-- year: string (nullable = true) |-- make: string (nullable = true) |-- model: string (nullable = true) |-- comment: string (nullable = true) |-- blank: string (nullable = true) df.show() year make model comment blank 2012 tesla s no comment 1997 ford e350 go 1 th...
but wanted year
int
(and perhaps transform other columns).
the best come is
df.withcolumn("year2", 'year.cast("int")).select('year2 'year, 'make, 'model, 'comment, 'blank) org.apache.spark.sql.dataframe = [year: int, make: string, model: string, comment: string, blank: string]
which bit convoluted.
i'm coming r, , i'm used being able write, e.g.
df2 <- df %>% mutate(year = year %>% as.integer, make = make %>% toupper)
i'm missing something, since there should better way in spark/scala...
[edit: march 2016: votes! though really, not best answer, think solutions based on withcolumn
, withcolumnrenamed
, cast
put forward msemelman, martin senne , others simpler , cleaner].
i think approach ok, recall spark dataframe
(immutable) rdd of rows, we're never replacing column, creating new dataframe
each time new schema.
assuming have original df following schema:
scala> df.printschema root |-- year: string (nullable = true) |-- month: string (nullable = true) |-- dayofmonth: string (nullable = true) |-- dayofweek: string (nullable = true) |-- depdelay: string (nullable = true) |-- distance: string (nullable = true) |-- crsdeptime: string (nullable = true)
and udf's defined on 1 or several columns:
import org.apache.spark.sql.functions._ val toint = udf[int, string]( _.toint) val todouble = udf[double, string]( _.todouble) val tohour = udf((t: string) => "%04d".format(t.toint).take(2).toint ) val days_since_nearest_holidays = udf( (year:string, month:string, dayofmonth:string) => year.toint + 27 + month.toint-12 )
changing column types or building new dataframe can written this:
val featuredf = df .withcolumn("departuredelay", todouble(df("depdelay"))) .withcolumn("departurehour", tohour(df("crsdeptime"))) .withcolumn("dayofweek", toint(df("dayofweek"))) .withcolumn("dayofmonth", toint(df("dayofmonth"))) .withcolumn("month", toint(df("month"))) .withcolumn("distance", todouble(df("distance"))) .withcolumn("nearestholiday", days_since_nearest_holidays( df("year"), df("month"), df("dayofmonth")) ) .select("departuredelay", "departurehour", "dayofweek", "dayofmonth", "month", "distance", "nearestholiday")
which yields:
scala> df.printschema root |-- departuredelay: double (nullable = true) |-- departurehour: integer (nullable = true) |-- dayofweek: integer (nullable = true) |-- dayofmonth: integer (nullable = true) |-- month: integer (nullable = true) |-- distance: double (nullable = true) |-- nearestholiday: integer (nullable = true)
this pretty close own solution. simply, keeping type changes , other transformations separate udf val
s make code more readable , re-usable.
Comments
Post a Comment