其他
大数据分析工程师面试集锦5--Spark面试指南
点击上方“大数据与人工智能”,“星标或置顶公众号” 第一时间获取好内容
作者丨斌迪、HappyMint 编辑丨Zandy
本篇文章为大家带来spark面试指南,文内会有两种题型,问答题和代码题,题目大部分来自于网络上,有小部分是来自于工作中的总结,每个题目会给出一个参考答案。
为什么考察Spark?
为什么考察Spark?
基本概念
基本操作
val num = Array(1,2,3,4,5)
val rdd = sc.parallelize(num)
//或者
val rdd = sc.makeRDD(num)
val rdd = sc.textFile("hdfs://hans/data_warehouse/test/data")
val df = spark.read.json("/data/tmp/SparkSQL/people.json")
val caseClassDS = Seq(Person("Andy", 32)).toDS()
// 将DataFrame转换成DataSet
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
执行过程
spark-submit --master spark://node001:7077,node002:7077 --deploy-mode cluster --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.11-2.3.1.jar 10000
spark.executor.extraClassPath=/home/hadoop/work/lib/*
spark.driver.extraClassPath=/home/hadoop/work/lib/*
程序题
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import scala.util.Random
//初始化环境
val config = new SparkConf()
config.setMaster("local[2]")
va spark = SparkSession.builder().config(config).getOrCreate()
//模拟数据
var data: List[String] = Nil
for (i <- 1 to 1000)
data = data ::: "procuct" + Random.nextInt(10).toString + " url" + Random.nextInt(100).toString :: Nil
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.map(_.split(" "))//按照空格进行分割
.map(row =>((row(0),row(1)),1))
.reduceByKey(_+_)//将相同产品线和url聚合后求出访问次数
.map(row => (row._1._1,(row._1._2,row._2)))//将产品线作为key
.groupByKey()
.map(row => {
val result =
row._2.toList.sortBy(-_._2)//按照访问次数进行倒序排序
.map(_._1).take(3)//取出前三个url
(row._1,result)//返回结果
})
df.foreach(println)
//执行结果
(procuct5,List(url55, url85, url74))
(procuct8,List(url80, url91, url95))
(procuct6,List(url96, url25, url7))
(procuct2,List(url67, url36, url35))
(procuct7,List(url80, url93, url94))
(procuct4,List(url99, url57, url98))
(procuct1,List(url81, url68, url37))
(procuct0,List(url14, url64, url86))
(procuct3,List(url80, url28, url15))
(procuct9,List(url44, url65, url34))
小结