frompyspark.sqlimportSparkSession# Create a sample datasetdf=spark.range(1,1000000)# Perform some transformationsdf_transformed=df.select((df.id*2).alias("doubled_id"))# Cache the transformed datasetdf_transformed.cache()# Perform multiple actions on the cached dataprint("Output: ",df_transform...
cache属于懒执行算子,需要进行action操作后才会在内存中持久化数据,会为rdd添加血缘关系,todebugstring()输出如下,在rdd.cache()执行后并没有增加血缘关系,而执行action算子后,多出一个-CachedPartitions: 8; MemorySize: 311.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B,表示内存中有空间存储了该数据,...
Spark Cache and Persist are optimization techniques in DataFrame / Dataset for iterative and interactive Spark applications to improve the