1、select() select函数选择DataFrame的一列或者多列,返回新的DataFrame importpysparkfrompyspark.sqlimportSparkSessionspark=SparkSession.builder.appName('SparkByExamples.com').getOrCreate()data=[("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),("Robert","Williams","USA","CA"),...
StringType(),True)])),StructField('state',StringType(),True),StructField('gender',StringType(),True)])df2=spark.createDataFrame(data=data,schema=schema)df2.printSchema()df2.show(truncate=False)# shows all columns
r=Row(age=11,name='Alice')print r.columns #['age','name'] 选择一列或多列:select 代码语言:javascript 复制 df["age"]df.age df.select(“name”)df.select(df[‘name’],df[‘age’]+1)df.select(df.a,df.b,df.c)# 选择a、b、c三列 df.select(df["a"],df["b"],df["c"])# ...
将前面4列的数据类型转换为 float(假设原始数据是字符型 string); ## rename the columnsdf=data.toDF("sepal_length","sepal_width","petal_length","petal_width","class")frompyspark.sql.functionsimportcol# Convert all columns to floatforcol_nameindf.columns[:-1]:df=df.withColumn(col_name,col(c...
# DataFrame Example 2 columns = ["name","languagesAtSchool","currentState"] df=spark.createDataFrame(data).toDF(*columns) df.printSchema() 1. 2. 3. 4. DataFrame基础操作 1、select() select函数选择DataFrame的一列或者多列,返回新的DataFrame import pyspark from pyspark.sql import SparkSession ...
drop_list=['Dates','DayOfWeek','PdDistrict','Resolution','Address','X','Y']data=data.select([columnforcolumnindata.columnsifcolumn notindrop_list])data.show(5) 利用printSchema()方法来显示数据的结构: 代码语言:javascript 复制 data.printSchema() ...
data.select('columns').distinct().show() 1. 跟py中的set一样,可以distinct()一下去重,同时也可以.count()计算剩余个数 随机抽样 随机抽样有两种方式,一种是在HIVE里面查数随机;另一种是在pyspark之中。 HIVE里面查数随机 sql = "select * from data order by rand() limit 2000" ...
data.select('columns').distinct().show() 随机抽样有两种方式,一种是在HIVE里面查数随机;另一种是在pyspark之中 1 2 3 4 5 #HIVE里面查数随机 sql="select * from data order by rand() limit 2000" #pyspark之中 sample=result.sample(False,0.5,0)# randomly select 50% of lines ...
5.1、“Select”操作 可以通过属性(“author”)或索引(dataframe[‘author’])来获取列。 #Showallentriesintitlecolumn dataframe.select("author").show(10) #Showallentriesintitle,author,rank,pricecolumns dataframe.select("author","title","rank","price").show(10) ...
from pyspark.sql.functions import col fga_py = df.groupBy('yr') .agg({'mp' : 'sum', 'fg3a' : 'sum'}) .select(col('yr'), (36*col('sum(fg3a)')/col('sum(mp)')).alias('fg3a_p36m')) .orderBy('yr') from matplotlib import pyplot as plt import seaborn as sns plt.sty...