PythonScalaJava Single Machine word_count_python.py WordCountScala.scala WordCountJava.java Spark RDD word_count_rdd.py WordCountRDD.scala Spark DataFrame word_count_dataframe.py WordCountDataFrame.scala Spark SQL word_count_sql.py WordCountSQL.scala Google Cloud Storage word_count_rdd_gcs.py ...
Transformations - dbt vs PySpark Since the dataset was small, there was no need to use the parallel processing of PySpark. Additionally there would of been a lot of additional infrastructure overhead of setting up a spark cluster using Dataproc, Google's managed Spark cluster. I choose to use...