How to use Dataframe in pySpark (compared with SQL)

— version 1.0: initial @20190428– version 1.1: add image processing, broadcast and accumulator– version 1.2: add ambiguous column handle, maptype When we implement spark, there are two ways to manipulate data: RDD and Dataframe. I don’t know why in most of books, they start with RDD rather than Dataframe. Since RDD is more OOP and… Continue reading How to use Dataframe in pySpark (compared with SQL)