拥抱开源,我们是认真的-网易易数2020年Apache Spark贡献总结
开源软件正在吞噬世界,在未来,没有一家企业能够脱离它们,也不可能存在一家企业能够脱离开源的开发协作方式,也没有一家企业会拒绝这种本质上是双赢的局面。本文来自网易数帆旗下网易易数开发团队,记录其2020年在Apache Spark上的主要贡献。
前言
“自研不等于自主可控, 开放才是未来。”来自网易副总裁、杭州研究院执行院长、网易数帆总经理汪源的一席话,体现了拥抱开源和构建开源生态方面,网易人的决心和一贯坚持。
我们为什么要拥抱开源
对企业而言,答案是四字真言:有利可图。这是摆在我们面前现实且需要正视的目的。对于企业来讲,“使用”开源可以降低总体拥有成本并提升软件质量,可以提前获取最前沿的创新技术,降本提效促发展;“参与”开源可以提升雇员技术水平,在技术社区建立品牌形象,与技术大拿建立信任,人才培养选拔两手抓,“构建”开源生态可推广技术理念、构建行业标准和加深上下游行业合作,技术带头共协同。
多家机构的分析报告显示,企业对于开源的拥抱程度逐渐增强。IT 领导者认为,企业开源对于他们的企业基础架构软件战略至关重要,专有软件会导致高昂的资本支出和供应商锁定。
对技术人员而言,拥抱开源,也是为自身热爱的事业全情投入,用技术创新改变世界的最佳场所。
网易与Apache Spark
Apache Spark 目前是网易集团内部主流大数据计算引擎,日承接PB级数据处理,涵盖离线计算、实时计算和传统机器学习等方方面面的任务。
为了减少Spark在网易内部的维护成本,和促进 Spark 新技术在网易的快速落地,网易易数团队采取了以下三个策略。
1. 一体化
我们的技术开发人员都在不同程度上积极参与社区贡献,加深和社区的合作,和社区融为一体。
社区寻求能够持续贡献的开发者,可以建立长期良好的合作关系,并相互给与足够的信任。这将一定程度上使我们的内部需求积极转变为行业标准,社区最新的技术也可以实时落地。
文末所附清单不完全统计了截至2020年底网易人在Apache Spark 的主要贡献,约 300 commits。
2. 插件化
当然,业务倾向使然,相应的技术配套在各企业实体或行业中总是伴随分歧的。所以,对于Spark源代码的改造是不可避免的。出于可维护的目的,我们的策略是将这样特异性的需求从Spark中独立出来,形成插件,降低与Spark主干的耦合性,轻量化的迭代。以下是几个插件的介绍:
1. Spark-ranger
Spark-ranger 是权限控制插件,为提供Spark计算引擎 SQL 标准的细粒度权限控制,包括列级别的鉴权、行级别的过滤,及数据匿名等功能。在大数据数仓场景下,Spark SQL作为一款高性能的查询引擎在数据安全方面的功能一直是其短板。本项目创立的目的,旨在弥补猛犸产品在数仓管理功能上最后一块权限漏洞。Spark-ranger 作为猛犸安全组件的一部分,在公司内部每天需要为业务方数十万的Spark任务提供鉴权服务,同时也在公司外部保证着客户的数据安全。项目目前已经托管给Apache 基金会,作为一个子模块在 https://submarine.apache.org/ 项目中进行维护。
Spark-ranger开源地址:https://github.com/NetEase/spark-ranger
2. Spark-greenplum
Spark-greenplum 是大数据数仓和PostgreSQL及Greenplum数据库的性能传输工具,提供Apache Spark原生 API 百倍性能的提升。项目创立的目的是为了提升网易猛犸和网易有数之间数据交换的能力。Spark-greenplum 项目用于网易有数从网易猛犸大数据平台的取数环节。
Spark-greenplum 开源地址:https://github.com/NetEase/spark-greenplum
3. Spark-alarm
Spark-alarm 是细粒度的 Spark 任务监控工具,可以对 Spark 任务进行全面的监控,已经自定义关键指标的监控,并提供丰富的报警手段,如网易哨兵,邮件和EasyOps等。项目的目的是有效的保障各类业务KPI/SLA任务的安全运行。spark-alarm 是一个任务级别的SDK,目前提供网易内部各业务方,埋点在各自的关键任务中。
Spark-alarm开源地址:https://github.com/NetEase/spark-alarm
3. 生态化
如前面提到的构建开源生态可推广技术理念、构建行业标准和加深上下游行业合作,起到技术带头共协同的作用。
在大数据领域,我们最初围绕Google的三篇论文打造了Apache Hadoop生态,然后我们有围绕Hadoop生态构建了活跃的Apache Spark生态,现在又有不同层面的产品,如数据湖等围绕该生态构建实现真正的批流一体计算,同时和CNCF的Kubernetes社区又可以交叉融合实现大数据与云计算的深度融合。我们也基于该生态之上,从网易及网易合作伙伴的业态出发,打造了Kyuubi生态。
Kyuubi是高性能大数据JDBC通用服务引擎。在大数据领域,Kyuubi以灵活的架构和统一的SQL API去适配不同的计算引擎以追求极致的计算性能,适配不同的资源调度器以适应存算耦合分离的自由切换,适配不同的计算模型以实现批流一体架构, 适配不同的业务场景以实现一站式的大数据应用开发。目标是让用户能像处理普通数据一样处理大数据。
第一、通用易用的数据访问方式。Kyuubi依托标准化的JDBC接口提供大数据场景下便捷易用的数据访问访问方式,终端用户无需对底层大数据平台(计算引擎、存储服务、元数据管理等)感知即可专注开发自身业务系统及挖掘数据价值。
第二、高性能的数据查询能力。Kyuubi依托Apache Spark及Flink等计算引擎提供高性能的数据查询能力,引擎自身能力每一次提升都可以帮助 Kyuubi服务的性能产生质的飞跃,在此基础之上,Kyuubi 同时提供数据缓存、查询动态优化等能力进一步提升性能。一方面,对于访问频率高的查询通过设置缓存提升查询效率;另一方面,根据用户访问数据量的规模动态优化查询计划,在支持海量结果流式返回的同时保证性能优化。
第三、完备的企业级特性支持。依托 Kyuubi 自身架构的特点,提供认证、鉴权服务,保障数据安全性;提供健壮的高可用服务,保障服务的可用性;提供多租户资源资源隔离的能力,提供端到端的计算资源及数据安全隔离;提供两级弹性资源管理,在有效提升资源利用率的基础上合理控制成本,并且有效的覆盖交互式、批处理和点查、全表Scan等各种场景的性能及响应要求。
第四,丰富的生态支持与构建。一个优秀的开源产品离不开优秀的开源生态支持。Kyuubi 在拥抱Spark等顶级开源生态的同时,一方面有效的利用这些项目本身生态的开放性,可以快速使得Kyuubi对其既有生态及新特性新生态的拓展,如云原生支持、数据湖(Data Lake/Lake House)的支持;另一方面,Kyuubi也积极构建和完善自己的生态,弥补各个环节的空缺,如https://github.com/netease/spark-ranger 项目可完善大数据链路中权限控制短板,https://github.com/netease/spark-greenplum 项目可解决Spark与传统数据库PostgreSQL和MPP数据库Greenplum数据交换的性能问题等等。
Kyuubi开源地址:https://github.com/netease/kyuubi
总结
2020年,不平凡的一年。来自大自然的威胁,让我们深刻地认识到全人类开放合作的重要性。
一个开源社区的本质是开发者。拥抱开源,构建开源生态,符合网易的使命愿景:网聚人的力量,以科技创新缔造美好生活
参与开源,当然除了上面所提到的符合企业自身利益,同时也是因为热爱:为热爱全心投入。
附:截至2020年底网易人在Apache Spark 的主要贡献
* ae1d05927a [SPARK-33892][SQL] Display char/varchar in DESC and SHOW CREATE TABLE
* 2287f56a3e (origin/master, origin/HEAD, master) [SPARK-33879][SQL] Char Varchar values fails w/ match error as partition columns
* a3dd8dacee [SPARK-33877][SQL] SQL reference documents for INSERT w/ a column list
* 6da5cdf1db [SPARK-33876][SQL] Add length-check for reading char/varchar from tables w/ a external location
* f5fd10b1bc (SparkSPARK-33877) [SPARK-33834][SQL] Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
* dd44ba5460 [SPARK-32976][SQL][FOLLOWUP] SET and RESTORE hive.exec.dynamic.partition.mode for HiveSQLInsertTestSuite to avoid flakiness
* c17c76dd16 [SPARK-33599][SQL][FOLLOWUP] FIX Github Action with unidoc
* 728a1298af [SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions
* 205d8e40bc [SPARK-32991][SQL] [FOLLOWUP] Reset command relies on session initials first
* 4d47ac4b4b [SPARK-33705][SQL][TEST] Fix HiveThriftHttpServerSuite flakiness
* 31e0baca30 [SPARK-33740][SQL] hadoop configs in hive-site.xml can overrides pre-existing hadoop ones
* c88eddac3b [SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types
* da72b87374 [SPARK-33641][SQL] Invalidate new char/varchar types in public APIs that produce incorrect results
* 2da72593c1 [SPARK-32976][SQL] Support column list in INSERT statement
* cdd8e51742 [SPARK-33419][SQL] Unexpected behavior when using SET commands before a query in SparkSession.sql
* 4335af075a [MINOR][DOC] spark.executor.memoryOverhead is not cluster-mode only
* 036c11b0d4 [SPARK-33397][YARN][DOC] Fix generating md to html for available-patterns-for-shs-custom-executor-log-url
* 82d500a05c [SPARK-33193][SQL][TEST] Hive ThriftServer JDBC Database MetaData API Behavior Auditing
* e21bb710e5 [SPARK-32991][SQL] Use conf in shared state as the original configuraion for RESET
* dcb0820433 [SPARK-32785][SQL][DOCS][FOLLOWUP] Update migaration guide for incomplete interval literals
* 2507301705 [SPARK-33159][SQL] Use hive-service-rpc as dependency instead of inlining the generated code
* 17d309dfac [SPARK-32963][SQL] empty string should be consistent for schema name in SparkGetSchemasOperation
* e2a740147c [SPARK-32874][SQL][FOLLOWUP][TEST-HIVE1.2][TEST-HADOOP2.7] Fix spark-master-test-sbt-hadoop-2.7-hive-1.2
* 9e9d4b6994 [SPARK-32905][CORE][YARN] ApplicationMaster fails to receive UpdateDelegationTokens message
* 316242b768 [SPARK-32874][SQL][TEST] Enhance result set meta data check for execute statement operation with thrift server
* 5669b212ec [SPARK-32840][SQL] Invalid interval value can happen to be just adhesive with the unit
* 9ab8a2c36d [SPARK-32826][SQL] Set the right column size for the null type in SparkGetColumnsOperation
* de44e9cfa0 [SPARK-32785][SQL] Interval with dangling parts should not results null
* 1fba286407 [SPARK-32781][SQL] Non-ASCII characters are mistakenly omitted in the middle of intervals
* 6dacba7fa0 [SPARK-32733][SQL] Add extended information - arguments/examples/since/notes of expressions to the remarks field of GetFunctionsOperation
* 0626901bcb [SPARK-32729][SQL][DOCS] Add missing since version for math functions
* f14f3742e0 [SPARK-32696][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] Get columns operation should handle interval column properly
* 1f3bb51757 [SPARK-32683][DOCS][SQL] Fix doc error and add migration guide for datetime pattern F
* c26a97637f Revert "[SPARK-32412][SQL] Unify error handling for spark thrift serv…
* 1b6f482adb [SPARK-32492][SQL][FOLLOWUP][TEST-MAVEN] Fix jenkins maven jobs
* 7f5326c082 [SPARK-32492][SQL] Fulfill missing column meta information COLUMN_SIZE /DECIMAL_DIGITS/NUM_PREC_RADIX/ORDINAL_POSITION for thriftserver client tools
* 3deb59d5c2 [SPARK-31709][SQL] Proper base path for database/table location when it is a relative path
* f4800406a4 [SPARK-32406][SQL][FOLLOWUP] Make RESET fail against static and core configs
* 510a1656e6 [SPARK-32412][SQL] Unify error handling for spark thrift server operations
* d315ebf3a7 [SPARK-32424][SQL] Fix silent data change for timestamp parsing if overflow happens
* d3596c04b0 [SPARK-32406][SQL] Make RESET syntax support single configuration reset
* b151194299 [SPARK-32392][SQL] Reduce duplicate error log for executing sql statement operation in thrift server
* 29b7eaa438 [MINOR][SQL] Fix warning message for ThriftCLIService.GetCrossReference and GetPrimaryKeys
* efa70b8755 [SPARK-32145][SQL][FOLLOWUP] Fix type in the error log of SparkOperation
* bdeb626c5a [SPARK-32272][SQL] Add SQL standard command SET TIME ZONE
* 4609f1fdab [SPARK-32207][SQL] Support 'F'-suffixed Float Literals
* 59a70879c0 [SPARK-32145][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message
* 9f8e15bb2e [SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
* 93529a8536 [SPARK-31957][SQL] Cleanup hive scratch dir for the developer api startWithContext
* abc8ccc37b [SPARK-31926][SQL][TESTS][FOLLOWUP][TEST-HIVE1.2][TEST-MAVEN] Fix concurrency issue for ThriftCLIService to getPortNumber
* a0187cd6b5 [SPARK-31926][SQL][TEST-HIVE1.2][TEST-MAVEN] Fix concurrency issue for ThriftCLIService to getPortNumber
* 22dda6e18e [SPARK-31939][SQL][TEST-JAVA11] Fix Parsing day of year when year field pattern is missing
* 6a424b93e5 [SPARK-31830][SQL] Consistent error handling for datetime formatting and parsing functions
* 02f32cfae4 [SPARK-31926][SQL][TEST-HIVE1.2] Fix concurrency issue for ThriftCLIService to getPortNumber
* fc6af9d900 [SPARK-31867][SQL][FOLLOWUP] Check result differences for datetime formatting
* 9d5b5d0a58 [SPARK-31879][SQL][TEST-JAVA11] Make week-based pattern invalid for formatting too
* afcc14c6d2 [SPARK-31896][SQL] Handle am-pm timestamp parsing when hour is missing
* afe95bd9ad [SPARK-31892][SQL] Disable week-based date filed for parsing
* c59f51bcc2 [SPARK-31879][SQL] Using GB as default Locale for datetime formatters
* 547c5bf552 [SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10
* fe1da296da [SPARK-31833][SQL][TEST-HIVE1.2] Set HiveThriftServer2 with actual port while configured 0
* 311fe6a880 [SPARK-31835][SQL][TESTS] Add zoneId to codegen related tests in DateExpressionsSuite
* 695cb617d4 (t1) [SPARK-31771][SQL] Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
* 0df8dd6073 [SPARK-30352][SQL] DataSourceV2: Add CURRENT_CATALOG function
* 7e2ed40d58 [SPARK-31759][DEPLOY] Support configurable max number of rotate logs for spark daemons
* 1f29f1ba58 [SPARK-31684][SQL] Overwrite partition failed with 'WRONG FS' when the target partition is not belong to the filesystem as same as the table
* 1d66085a93 [SPARK-31289][TEST][TEST-HIVE1.2] Eliminate org.apache.spark.sql.hive.thriftserver.CliSuite flakiness
* 503faa24d3 [SPARK-31715][SQL][TEST] Fix flaky SparkSQLEnvSuite that sometimes varies single derby instance standard
* ce714d8189 [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs
* b31ae7bb0b [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
* bd6b53cc0b [SPARK-31631][TESTS] Fix test flakiness caused by MiniKdc which throws 'address in use' BindException with retry
* 9241f8282f [SPARK-31586][SQL][FOLLOWUP] Restore SQL string for datetime - interval operations
* ea525fe8c0 [SPARK-31597][SQL] extracting day from intervals should be interval.days + days in interval.microsecond
* 295d866969 [SPARK-31596][SQL][DOCS] Generate SQL Configurations from hive module to configuration doc
* 54996be4d2 [SPARK-31527][SQL][TESTS][FOLLOWUP] Add a benchmark test for datetime add/subtract interval operations
* beec8d535f [SPARK-31586][SQL] Replace expression TimeSub(l, r) with TimeAdd(l -r)
* 5ba467ca1d [SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc
* ebc8fa50d0 [SPARK-31527][SQL] date add/subtract interval only allow those day precision in ansi mode
* 7959808e96 [SPARK-31564][TESTS] Fix flaky AllExecutionsPageSuite for checking 1970
* f92652d0b5 [SPARK-31528][SQL] Remove millennium, century, decade from trunc/date_trunc fucntions
* caf3ab8411 [SPARK-31552][SQL] Fix ClassCastException in ScalaReflection arrayClassFor
* 8424f55229 [SPARK-31532][SQL] Builder should not propagate static sql configs to the existing active or default SparkSession
* 8dc2c0247b [SPARK-31522][SQL] Hive metastore client initialization related configurations should be static
* 3b5792114a [SPARK-31474][SQL][FOLLOWUP] Replace _FUNC_ placeholder with functionname in the note field of expression info
* 37d2e037ed [SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function
* 2c2062ea7c [SPARK-31498][SQL][DOCS] Dump public static sql configurations through doc generation
* 1985437110 [SPARK-31474][SQL] Consistency between dayofweek/dow in extract exprsession and dayofweek function
* 77cb7cde0d [SPARK-31469][SQL][TESTS][FOLLOWUP] Remove unsupported fields from ExtractBenchmark
* 697083c051 [SPARK-31469][SQL] Make extract interval field ANSI compliance
* 31b907748d [SPARK-31414][SQL][DOCS][FOLLOWUP] Update default datetime pattern for json/csv APIs documentations
* d65f534c5a [SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing
* a454510917 [SPARK-31392][SQL] Support CalendarInterval to be reflect to CalendarntervalType
* 3c94a7c8f5 [SPARK-29311][SQL][FOLLOWUP] Add migration guide for extracting second from datetimes
* 1ce584f6b7 [SPARK-31321][SQL] Remove SaveMode check in v2 FileWriteBuilder
* f376d24ea1 [SPARK-31280][SQL] Perform propagating empty relation after RewritePredicateSubquery
* 5945d46c11 [SPARK-31225][SQL] Override sql method of OuterReference
* 8be16907c2 [SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir
* 44bd36ad7b [SPARK-31234][SQL] ResetCommand should reset config to sc.conf only
* b024a8a69e [MINOR][DOCS] Fix some links for python api doc
* 336621e277 [SPARK-31258][BUILD] Pin the avro version in SBT
* f81f11822c [SPARK-31189][R][DOCS][FOLLOWUP] Replace Datetime pattern links in R doc
* 88ae6c4481 [SPARK-31189][SQL][DOCS] Fix errors and missing parts for datetime pattern document
* 3d695954e5 [SPARK-31150][SQL][FOLLOWUP] handle ' as escape for text
* 57fcc49306 [SPARK-31176][SQL] Remove support for 'e'/'c' as datetime pattern charactar
* f1d27cdd91 [SPARK-31119][SQL] Add interval value support for extract expression as extract source
* 5bc0d76591 [SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir
* 0946a9514f [SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp
* fbc9dc7e9d [SPARK-31129][SQL][TESTS] Fix IntervalBenchmark and DateTimeBenchmark
* 7b4b29e8d9 [SPARK-31131][SQL] Remove the unnecessary config spark.sql.legacy.timeParser.enabled
* 18f2730874 [SPARK-31066][SQL][TEST-HIVE1.2] Disable useless and uncleaned hive SessionState initialization parts
* 2b46662bd0 [SPARK-31111][SQL][TESTS] Fix interval output issue in ExtractBenchmark
* 3bd6ebff81 [SPARK-30189][SQL] Interval from year-month/date-time string should handle whitespaces
* f45ae7f2c5 [SPARK-31038][SQL] Add checkValue for spark.sql.session.timeZone
* 3edab6cc1d [MINOR][CORE] Expose the alias -c flag of --conf for spark-submit
* 1fac06c430 Revert "[SPARK-30808][SQL] Enable Java 8 time API in Thrift server"
* 1383bd459a [SPARK-30970][K8S][CORE] Fix NPE while resolving k8s master url
* 2d2706cb86 [SPARK-30956][SQL][TESTS] Use intercept instead of try-catch to assert failures in IntervalUtilsSuite
* a6026c830a [MINOR][BUILD] Fix make-distribution.sh to show usage without 'echo' cmd
* 761209c1f2 [SPARK-30919][SQL] Make interval multiply and divide's overflow behavior consistent with other operations
* 46019b6e6c [MINOR][DOCS] Fix fabric8 version in documentation
* 0353cbf092 [MINOR][DOC] Fix 2 style issues in running-on-kubernetes doc
* 58b9ca1e6f [SPARK-30592][SQL][FOLLOWUP] Add some round-trip test cases
* 3228d723a4 [SPARK-30603][SQL] Move RESERVED_PROPERTIES from SupportsNamespaces and TableCatalog to CatalogV2Util
* 8e280cebf2 [SPARK-30592][SQL] Interval support for csv and json funtions
* f2d71f5838 [SPARK-30591][SQL] Remove the nonstandard SET OWNER syntax for namespaces
* af705421db [SPARK-30593][SQL] Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and no round trip
* 730388b369 [SPARK-30547][SQL][FOLLOWUP] Update since anotation for CalendarInterval class
* 0388b7a3ec [SPARK-30568][SQL] Invalidate interval type as a field table schema
* 24efa43826 [SPARK-30019][SQL] Add the owner property to v2 table
* 4806cc5bd1 [SPARK-30547][SQL] Add unstable annotation to the CalendarInterval class
* 17857f9b8b [SPARK-30551][SQL] Disable comparison for interval type
* 82f25f5855 [SPARK-30507][SQL] TableCalalog reserved properties shoudn't be changed via options or tblpropeties
* bcf07cbf5f [SPARK-30018][SQL] Support ALTER DATABASE SET OWNER syntax
* c37312342e [SPARK-30183][SQL] Disallow to specify reserved properties in CREATE/ALTER NAMESPACE syntax
* 8c121b0827 [SPARK-30431][SQL] Update SqlBase.g4 to create commentSpec pattern like locationSpec
* c49388a484 [SPARK-30214][SQL] A new framework to resolve v2 commands
* e04309cb1f [SPARK-30341][SQL] Overflow check for interval arithmetic operations
* f0bf2eb006 [SPARK-30356][SQL] Codegen support for the function str_to_map
* da65a955ed [SPARK-30266][SQL] Avoid match error and int overflow in ApproximatePercentile and Percentile
* 12249fcdc7 [SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types
* d38f816748 [MINOR][SQL][DOC] Fix some format issues in Dataset API Doc
* cc7f1eb874 [SPARK-29774][SQL][FOLLOWUP] Add a migration guide for date_add and date_sub
* bf7215c510 [SPARK-30066][SQL][FOLLOWUP] Remove size field for interval column cache
* d3ec8b1735 [SPARK-30066][SQL] Support columnar execution on interval types
* 8f0eb7dc86 [SPARK-29587][SQL] Support SQL Standard type real as float(4) numeric as decimal
* 24c4ce1e64 [SPARK-28351][SQL][FOLLOWUP] Remove 'DELETE FROM' from unsupportedHiveNativeCommands
* e88d74052b [SPARK-30147][SQL] Trim the string when cast string type to booleans
* 35bab33984 [SPARK-30121][BUILD] Fix memory usage in sbt build script
* b9cae37750 [SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres
* 332e252a14 [SPARK-29425][SQL] The ownership of a database should be respected
* 65552a81d1 [SPARK-30083][SQL] visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking
* 39291cff95 [SPARK-30048][SQL] Enable aggregates with interval type values for RelationalGroupedDataset
* 4e073f3c50 [SPARK-30047][SQL] Support interval types in UnsafeRow
* 4fd585d2c5 [SPARK-30008][SQL] The dataType of collect_list/collect_set aggs should be ArrayType(_, false)
* ed0c33fdd4 [SPARK-30026][SQL] Whitespaces can be identified as delimiters in interval string
* 8b0121bea8 [MINOR][DOC] Fix the CalendarIntervalType description
* de21f28f8a [SPARK-29986][SQL] casting string to date/timestamp/interval should trim all whitespaces
* 5cf475d288 [SPARK-30000][SQL] Trim the string when cast string type to decimals
* 2dd6807e42 [SPARK-28023][SQL] Add trim logic in UTF8String's toInt/toLong to make it consistent with other string-numeric casting
* d555f8fcc9 [SPARK-29961][SQL][FOLLOWUP] Remove useless test for VectorUDT
* 7a70670345 [SPARK-29961][SQL] Implement builtin function - typeof
* 79ed4ae2db [SPARK-29926][SQL] Fix weird interval string whose value is only a dangling decimal point
* ea010a2bc2 [SPARK-29873][SQL][TEST][FOLLOWUP] set operations should not escape when regen golden file with --SET --import both specified
* ae6b711b26 [SPARK-29941][SQL] Add ansi type aliases for char and decimal
* 50f6d930da [SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval
* 5cebe587c7 [SPARK-29783][SQL] Support SQL Standard/ISO_8601 output style for interval type
* 0c68578fa9 [SPARK-29888][SQL] new interval string parser shall handle numeric with only fractional part
* 15a72f3755 [SPARK-29287][CORE] Add LaunchedExecutor message to tell driver which executor is ready for making offers
* f926809a1f [SPARK-29390][SQL] Add the justify_days(), justify_hours() and justif_interval() functions
* d99398e9f5 [SPARK-29855][SQL] typed literals with negative sign with proper result or exception
* d06a9cc4bd [SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values
* e026412d9c [SPARK-29679][SQL] Make interval type comparable and orderable
* e7f7990bc3 [SPARK-29688][SQL] Support average for interval type values
* 0a03839366 [SPARK-29787][SQL] Move methods add/subtract/negate from CalendarInterval to IntervalUtils
* 9562b26914 [SPARK-29757][SQL] Move calendar interval constants together
* 3437862975 [SPARK-29387][SQL][FOLLOWUP] Fix issues of the multiply and divide for intervals
* 4615769736 [SPARK-29603][YARN] Support application priority for YARN priority scheduling
* 44b8fbcc58 [SPARK-29663][SQL] Support sum with interval type values
* 8cf76f8d61 [SPARK-29285][SHUFFLE] Temporary shuffle files should be able to handle disk failures
* 5ba17d09ac [SPARK-29722][SQL] Non reversed keywords should be able to be used in high order functions
* dc987f0c8b [SPARK-29653][SQL] Fix MICROS_PER_MONTH in IntervalUtils
* 8e667db5d8 [SPARK-29629][SQL] Support typed integer literal expression
* 9a46702791 [SPARK-29554][SQL] Add `version` SQL function
* 0cf4f07c66 [SPARK-29545][SQL] Add support for bit_xor aggregate function
* 5b4d9170ed [SPARK-27879][SQL] Add support for bit_and and bit_or aggregates
* ef4c298cc9 [SPARK-29405][SQL] Alter table / Insert statements should not change a table's ownership
* 4b902d3b45 [SPARK-29491][SQL] Add bit_count function support
* 6d4cc7b855 [SPARK-27880][SQL] Add bool_and for every and bool_or for any as function aliases
* 02c5b4f763 [SPARK-28947][K8S] Status logging not happens at an interval for liveness
* f4c73b7c68 [SPARK-27301][DSTREAM] Shorten the FileSystem cached life cycle to the cleanup method inner scope
* ac9c0536bc [SPARK-26794][SQL] SparkSession enableHiveSupport does not point to hive but in-memory while the SparkContext exists
* f8346d2fc0 [SPARK-25174][YARN] Limit the size of diagnostic message for am to unregister itself from rm
* 4a2b15f0af [SPARK-24241][SUBMIT] Do not fail fast when dynamic resource allocation enabled with 0 executor
* a7755fd8ce [SPARK-23639][SQL] Obtain token before init metastore client in SparkSQL CLI
* 189f56f3dc [SPARK-23383][BUILD][MINOR] Make a distribution should exit with usage while detecting wrong options
* eefec93d19 [SPARK-23295][BUILD][MINOR] Exclude Waring message when generating versions in make-distribution.sh
* dd52681bf5 [SPARK-23253][CORE][SHUFFLE] Only write shuffle temporary index file when there is not an existing one
* 793841c6b8 [SPARK-21771][SQL] remove useless hive client in SparkSQLEnv
* 9fa703e893 [SPARK-22950][SQL] Handle ChildFirstURLClassLoader's parent
* 28ab5bf597 [SPARK-22487][SQL][HIVE] Remove the unused HIVE_EXECUTION_VERSION property
* c755b0d910 [SPARK-22463][YARN][SQL][HIVE] add hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distribute archive
* ee571d79e5 [SPARK-22466][SPARK SUBMIT] export SPARK_CONF_DIR while conf is default
* 99e32f8ba5 [SPARK-22224][SQL] Override toString of KeyValue/Relational-GroupedDataset
* 581200af71 [SPARK-21428][SQL][FOLLOWUP] CliSessionState should point to the actual metastore not a dummy one
* b83b502c41 [SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars for reusing CliSessionState
* 2387f1e316 [SPARK-21675][WEBUI] Add a navigation bar at the bottom of the Details for Stage Page
* e9d268f63e [SPARK-20096][SPARK SUBMIT][MINOR] Expose the right queue name not null if set by --conf or configure file
* 7363dde634 [SPARK-19626][YARN] Using the correct config to set credentials update time
* e33053ee00 [SPARK-11583] [CORE] MapStatus Using RoaringBitmap More Properly
* 7466031632 [SPARK-32106][SQL] Implement script transform in sql/core
* 0603913c66 [SPARK-33593][SQL] Vector reader got incorrect data with binary partition value
* 25c6cc25f7 [SPARK-26341][WEBUI] Expose executor memory metrics at the stage level, in the Stages tab
* 5f9a7fea06 [SPARK-33428][SQL] Conv UDF use BigInt to avoid Long value overflow
* d7f4b2ad50 [SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+
* 47326ac1c6 [SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+
* dd32f45d20 [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBeingTransferred` use default value
* 34f5e7ce77 [SPARK-33302][SQL] Push down filters through Expand
* 0c943cd2fb [SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
* e43cd8ccef [SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive
* a1629b4a57 [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location
* f8277d3aa3 [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory
* ddc7012b3d [SPARK-32243][SQL] HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number error
* 0b5a379c1f [SPARK-33023][CORE] Judge path of Windows need add condition `Utils.isWindows`
* c336ddfdb8 [SPARK-32867][SQL] When explain, HiveTableRelation show limited message
* 5e6173ebef [SPARK-31670][SQL] Trim unnecessary Struct field alias in Aggregate/GroupingSets
* 55ce49ed28 [SPARK-32400][SQL][TEST][FOLLOWUP][TEST-MAVEN] Fix resource loading error in HiveScripTransformationSuite
* 9808c15eec [SPARK-32608][SQL][FOLLOW-UP][TEST-HADOOP2.7][TEST-HIVE1.2] Script Transform ROW FORMAT DELIMIT value should format value
* c75a82794f [SPARK-32667][SQL] Script transform 'default-serde' mode should pad null value to filling column
* 6dae11d034 [SPARK-32607][SQL] Script Transformation ROW FORMAT DELIMITED `TOK_TABLEROWFORMATLINES` only support '\n'
* 03e2de99ab [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value
* 643cd876e4 [SPARK-32352][SQL] Partially push down support data filter if it mixed in partition filters
* 4cf8c1d07d [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec
* d251443a02 [SPARK-32403][SQL] Refactor current ScriptTransformationExec
* 5521afbd22 [SPARK-32220][SQL][FOLLOW-UP] SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result
* 6d499647b3 [SPARK-32105][SQL] Refactor current ScriptTransformationExec code
* 09789ff725 [SPARK-31226][CORE][TESTS] SizeBasedCoalesce logic will lose partition
* 560fe1f54c [SPARK-32220][SQL] SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result
* 15fb5d7677 [SPARK-28169][SQL] Convert scan predicate condition to CNF
* 0d9faf602e [SPARK-31655][BUILD] Upgrade snappy-java to 1.1.7.5
* 6bc8d84130 [SPARK-29492][SQL] Reset HiveSession's SessionState conf's ClassLoader when sync mode
* 246c398d59 [SPARK-30435][DOC] Update doc of Supported Hive Features
* 3eade744f8 [SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf
* da27f91560 [SPARK-29957][TEST] Reset MiniKDC's default enctypes to fit jdk8/jdk11
* 6146dc4562 [SPARK-29874][SQL] Optimize Dataset.isEmpty()
* eb79af8dae [SPARK-29145][SQL][FOLLOW-UP] Move tests from `SubquerySuite` to `subquery/in-subquery/in-joins.sql`
* e524a3a223 [SPARK-29742][BUILD] Update checkstyle plugin's check dir scope
* d6e33dc377 [SPARK-29599][WEBUI] Support pagination for session table in JDBC/ODBC Tab
* 67cf0433ee [SPARK-29145][SQL] Support sub-queries in join conditions
* 484f93e255 [SPARK-29530][SQL] Make SQLConf in SQL parse process thread safe
* 9a3dccae72 [SPARK-29379][SQL] SHOW FUNCTIONS show '!=', '<>' , 'between', 'case'
* ef81525a1a [SPARK-29308][BUILD] Update deps in dev/deps/spark-deps-hadoop-3.2 for hadoop-3.2
* 178a1f3558 [SPARK-29305][BUILD] Update LICENSE and NOTICE for Hadoop 3.2
* 0cf2f48dfe [SPARK-29022][SQL] Fix SparkSQLCLI can not add jars by AddJarCommand
* 1d4b2f010b [SPARK-29247][SQL] Redact sensitive information in when construct HiveClientHive.state
* cc852d4eec [SPARK-29015][SQL][TEST-HADOOP3.2] Reset class loader after initializing SessionState for built-in Hive 2.3
* d22768a6be [SPARK-29036][SQL] SparkThriftServer cancel job after execute() thread interrupted
* fe4bee8fd8 [SPARK-29162][SQL] Simplify NOT(IsNull(x)) and NOT(IsNotNull(x))
* 54d3f6e7ec [SPARK-28982][SQL] Implementation Spark's own GetTypeInfoOperation
* 9f478a6832 [SPARK-28901][SQL] SparkThriftServer's Cancel SQL Operation show it in JDBC Tab UI
* 036fd3903f [SPARK-27637][SHUFFLE][FOLLOW-UP] For nettyBlockTransferService, if IOException occurred while create client, check whether relative executor is alive before retry #24533
* e853f068f6 [SPARK-33526][SQL][FOLLOWUP] Fix flaky test due to timeout and fix docs
* 1dd63dccd8 [SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatalyst match special Array value
* bc46d273e0 [SPARK-33840][DOCS] Add spark.sql.files.minPartitionNum to performence tuning doc
* 839d6899ad [SPARK-33733][SQL] PullOutNondeterministic should check and collect deterministic field
* 5bab27e00b [SPARK-33526][SQL] Add config to control if cancel invoke interrupt task on thriftserver
作者:网易易数Spark开发团队