Spark的优化怎么做

 时间:2024-10-25 02:40:53

1、 repartition and coalesceSpark provides the `repartition()` function, which shuffles the dataacross the network to create a new set of partitions. Keep in mindthat repartitioning your data is a fairly expensive operation. Sparkalso has an optimized version of `repartition()` called `coalesce()`that allows avoiding data movement, but only if you are decreasingthe number of RDD partitions. To know whether you can safely calloalesce(), you can check the size of the RDD using `rdd.partitions.size()`in Java/Scala and `rdd.getNumPartitions()` in Python and make surethat you are coalescing it to fewer partitions than it currently has.总结:当要对 rdd 进行重新分片时,如果目标片区数量小于当前片区数量,那么用coalesce,不要用repartition。关于partition的更多优化细节,参考chapter 4 of Learning Spark

2、Passing Functions to SparkIn Python, we have three options for passing functions into Spark.lambda expressions word = rdd.filter(lambda s: "error" in s)top-level functions import my_personal_lib word = rdd.filter(my_personal_lib.containsError)locally defined functions def containsError(s): return "error" in s word = rdd.filter(containsError)One issue to watch out for when passing functions is inadvertently serializing the object containing the function. When you pass a function that is the member of an object, or contains references to fields in an object (e.g., self.field), Spark sends the entire object to worker nodes, which can be much larger than the bit of information you need. Sometimes this can also cause your program to fail, if your class contains objects that Python can’t figure out how to pickle.### wrong wayclass SearchFunctions(object):def __init__(self, query):self.query = querydef isMatch(self, s):return self.query in sdef getMatchesFunctionReference(self, rdd):# Problem: references all of "self" in "self.isMatch"return rdd.filter(self.isMatch)def getMatchesMemberReference(self, rdd):# Problem: references all of "self" in "self.query"return rdd.filter(lambda x: self.query in x)### the right wayclass WordFunctions(object):...def getMatchesNoReference(self, rdd):# Safe: extract only the field we need into a local variablequery = self.queryreturn rdd.filter(lambda x: query in x)

  • PHP 如何用正则表达式删除双引号
  • max2ae过时了,max和ae交互新方法
  • AI绘制CG海报图案
  • 红板报怎么开启个性化推送
  • 《阿帕奇:空中突击》图文攻略Lord Of War
  • 热门搜索
    昌吉旅游 西溪湿地旅游攻略 山东旅游职业学院官网 西冲旅游攻略 大同旅游景点大全 深圳到欧洲旅游 哈尔滨旅游团 北京市旅游景点 张家界旅游介绍 三亚旅游价格