scala - How to change column types in Spark SQL's DataFrame? -


suppose i'm doing like:

val df = sqlcontext.load("com.databricks.spark.csv", map("path" -> "cars.csv", "header" -> "true")) df.printschema()  root  |-- year: string (nullable = true)  |-- make: string (nullable = true)  |-- model: string (nullable = true)  |-- comment: string (nullable = true)  |-- blank: string (nullable = true)  df.show() year make  model comment              blank 2012 tesla s     no comment                 1997 ford  e350  go 1 th...   

but wanted year int (and perhaps transform other columns).

the best come is

df.withcolumn("year2", 'year.cast("int")).select('year2 'year, 'make, 'model, 'comment, 'blank) org.apache.spark.sql.dataframe = [year: int, make: string, model: string, comment: string, blank: string] 

which bit convoluted.

i'm coming r, , i'm used being able write, e.g.

df2 <- df %>%    mutate(year = year %>% as.integer,            make = make %>% toupper) 

i'm missing something, since there should better way in spark/scala...

[edit: march 2016: votes! though really, not best answer, think solutions based on withcolumn, withcolumnrenamed , cast put forward msemelman, martin senne , others simpler , cleaner].

i think approach ok, recall spark dataframe (immutable) rdd of rows, we're never replacing column, creating new dataframe each time new schema.

assuming have original df following schema:

scala> df.printschema root  |-- year: string (nullable = true)  |-- month: string (nullable = true)  |-- dayofmonth: string (nullable = true)  |-- dayofweek: string (nullable = true)  |-- depdelay: string (nullable = true)  |-- distance: string (nullable = true)  |-- crsdeptime: string (nullable = true) 

and udf's defined on 1 or several columns:

import org.apache.spark.sql.functions._  val toint    = udf[int, string]( _.toint) val todouble = udf[double, string]( _.todouble) val tohour   = udf((t: string) => "%04d".format(t.toint).take(2).toint )  val days_since_nearest_holidays = udf(    (year:string, month:string, dayofmonth:string) => year.toint + 27 + month.toint-12  ) 

changing column types or building new dataframe can written this:

val featuredf = df .withcolumn("departuredelay", todouble(df("depdelay"))) .withcolumn("departurehour",  tohour(df("crsdeptime"))) .withcolumn("dayofweek",      toint(df("dayofweek")))               .withcolumn("dayofmonth",     toint(df("dayofmonth")))               .withcolumn("month",          toint(df("month")))               .withcolumn("distance",       todouble(df("distance")))               .withcolumn("nearestholiday", days_since_nearest_holidays(               df("year"), df("month"), df("dayofmonth"))             )               .select("departuredelay", "departurehour", "dayofweek", "dayofmonth",          "month", "distance", "nearestholiday")             

which yields:

scala> df.printschema root  |-- departuredelay: double (nullable = true)  |-- departurehour: integer (nullable = true)  |-- dayofweek: integer (nullable = true)  |-- dayofmonth: integer (nullable = true)  |-- month: integer (nullable = true)  |-- distance: double (nullable = true)  |-- nearestholiday: integer (nullable = true) 

this pretty close own solution. simply, keeping type changes , other transformations separate udf vals make code more readable , re-usable.


Comments

Popular posts from this blog

Payment information shows nothing in one page checkout page magento -

tcpdump - How to check if server received packet (acknowledged) -