spark的两种groupby,一种不易OOM 无rdd的,优点是不易内存溢出,缺点是只支持count、sum、max等groupby后操作 .select("the_key") .groupBy("the_key").count() .toDF("the_key","the_count") 1. 2. 3. 有rdd的,缺点是容易内存溢出,优点是可以进行groupby后每个group里的自定义操作 .rdd.groupBy(row=>...
Spark 取每个groupby的N条数据 如果用groupby接口的话,可能OOM, importorg.apache.spark.sql.expressions.Window importorg.apache.spark.sql.functions.{rand,row_number} valwindowFun=Window.partitionBy("groupby_column").orderBy(rand()) valresultDF=dataDF.withColumn("rank",row_number.over(windowFun)) ....
importorg.apache.spark.sql.functions._ valwindowFun=Window.partitionBy("groupby_column").orderBy(col("score").desc) valresultDF=dataDF.withColumn("rank",row_number.over(windowFun)) .filter().map((row:Row)=>{ //... }) 7. 8. 9....