Spark覆盖写入mysql表但不改变已有的表结构

2021-04-29

前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住给大家分享一下。点击跳转到网站：https://www.captainai.net/dongkelun

欢迎关注我的公众号

前言

本文记录Spark如何在表存在的情况时覆盖写入mysql但不修改已有的表结构，并进行主要的源码跟踪以了解其实现原理。主要场景为先用建表语句建好mysql表，然后用spark导入数据，可能会存在多次全表覆写导入的情况。

代码

已上传github

主要的参数为.option(“truncate”, true)，可以参考Spark官网http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

主要代码逻辑为，读取csv，进行日期转化，然后覆盖写入到已经建好的mysql表中。

package com.dkl.blog.spark.mysql

import java.util.Properties

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_date}

/**
 * Created by dongkelun on 2021/4/29 14:06
 *
 * Spark覆盖写入mysql表但不修改已有的表结构
 */
object SparkMysqlOverwriteTruncateTable {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("SparkMysqlOverwriteTruncateTable")
      .master("local[*]")
      .getOrCreate()

    //rewriteBatchedStatements参数为批量写入数据，可以增加写入效率
    val url = "jdbc:mysql://192.168.44.128:3306/test?useUnicode=true&characterEncoding=utf-8&rewriteBatchedStatements=true"
    val tableName = "trafficbase_cljbxx"
    val prop = new Properties()
    prop.put("user", "root")
    prop.put("password", "Root-123456")

    //读取本地csv
    var df = spark.read.option("header", "true").csv("D:\\文档\\inspur\\csg\\功能测评\\测试数据\\trafficbase_cljbxx.csv")
    //字符串转为日期类型
    df = df.withColumn("czsj", to_date(col("czsj"), "dd/mm/yyyy"))
    df.show()

    df.write.mode("overwrite")
      .option("truncate", true) //覆盖写入数据前先truncate table而不是drop table
      .jdbc(url = url, table = tableName, prop)
    spark.stop
  }
}

源码跟踪

本文仅进行简单的源码跟踪

Spark2.2.1

本来想以Spark3.0.1版本进行讲解，后来发现Spark3源码稍微做了些改动，因为本人之前主要用Spark2.2进行学习总结，所以先用Spark2.2.1的源码进行讲解，后面再在此基础上进行讲解Spark3的源码的一些变化，其实主要逻辑是一样的。

jdbc

先从入口jdbc函数开始

def jdbc(url: String, table: String, connectionProperties: Properties): Unit = {
  assertNotPartitioned("jdbc")
  assertNotBucketed("jdbc")
  // connectionProperties should override settings in extraOptions.
  this.extraOptions ++= connectionProperties.asScala
  // explicit url and dbtable should override all
  this.extraOptions += ("url" -> url, "dbtable" -> table)
  format("jdbc").save()
}

format(“jdbc”).save()

format方法返回DataFrameWriter

def format(source: String): DataFrameWriter[T] = {
  this.source = source //source="jdbc"
  this
}

def save(): Unit = {
  if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
    throw new AnalysisException("Hive data source can only be used with tables, you can not " +
      "write files of Hive data source directly.")
  }

  assertNotBucketed("save")

  runCommand(df.sparkSession, "save") {
    SaveIntoDataSourceCommand(
      query = df.logicalPlan,
      provider = source,
      partitionColumns = partitioningColumns.getOrElse(Nil),
      options = extraOptions.toMap,
      mode = mode)
  }
}

SaveIntoDataSourceCommand

接着执行SaveIntoDataSourceCommand的run方法

case class SaveIntoDataSourceCommand(
    query: LogicalPlan,
    provider: String,
    partitionColumns: Seq[String],
    options: Map[String, String],
    mode: SaveMode) extends RunnableCommand {

  override protected def innerChildren: Seq[QueryPlan[_]] = Seq(query)

  override def run(sparkSession: SparkSession): Seq[Row] = {
    DataSource(
      sparkSession,
      className = provider,
      partitionColumns = partitionColumns,
      options = options).write(mode, Dataset.ofRows(sparkSession, query))

    Seq.empty[Row]
  }

  override def simpleString: String = {
    val redacted = Utils.redact(SparkEnv.get.conf, options.toSeq).toMap
    s"SaveIntoDataSourceCommand ${provider}, ${partitionColumns}, ${redacted}, ${mode}"
  }
}

DataSource.write

run方法里主要执行DataSource.write方法

def write(mode: SaveMode, data: DataFrame): Unit = {
    if (data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
      throw new AnalysisException("Cannot save interval data type into external storage.")
    }

    providingClass.newInstance() match {
      case dataSource: CreatableRelationProvider =>
        dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data)
      case format: FileFormat =>
        writeInFileFormat(format, mode, data)
      case _ =>
        sys.error(s"${providingClass.getCanonicalName} does not allow create table as select.")
    }
  }
}

这里会执行到

1 2	case dataSource: CreatableRelationProvider => dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data)

CreatableRelationProvider.createRelation

然后看一下createRelation方法，这里的CreatableRelationProvider是一个接口，这里实际上执行其子类JdbcRelationProvider的createRelation

trait CreatableRelationProvider {
  /**
   * Saves a DataFrame to a destination (using data source-specific parameters)
   *
   * @param sqlContext SQLContext
   * @param mode specifies what happens when the destination already exists
   * @param parameters data source-specific parameters
   * @param data DataFrame to save (i.e. the rows after executing the query)
   * @return Relation with a known schema
   *
   * @since 1.3.0
   */
  def createRelation(
      sqlContext: SQLContext,
      mode: SaveMode,
      parameters: Map[String, String],
      data: DataFrame): BaseRelation
}

JdbcRelationProvider.createRelation

下面是最后真正要执行的方法，首先判断表是否存在，我们这个场景下表示存在的，然后进行mode的模式匹配，这里为Overwrite，然后进入到第一个if语句，我们这里在上面的程序里设置了
truncate为true，所以会满足条件，然后先执行truncateTable方法进行删除表数据但不会删除表结构，再执行saveTable方法将df的数据保存到表中实现覆盖写入。

override def createRelation(
      sqlContext: SQLContext,
      mode: SaveMode,
      parameters: Map[String, String],
      df: DataFrame): BaseRelation = {
    val options = new JDBCOptions(parameters)
    val isCaseSensitive = sqlContext.conf.caseSensitiveAnalysis

    val conn = JdbcUtils.createConnectionFactory(options)()
    try {
      val tableExists = JdbcUtils.tableExists(conn, options)
      if (tableExists) {//首先判断表是否存在
        mode match {
          case SaveMode.Overwrite => //如果mode为Overwrite
            if (options.isTruncate && isCascadingTruncateTable(options.url) == Some(false)) { //判断truncate是否为true
              // In this case, we should truncate table and then load.
              truncateTable(conn, options.table)        //先truncateTable
              val tableSchema = JdbcUtils.getSchemaOption(conn, options)
              saveTable(df, tableSchema, isCaseSensitive, options)  //再保存数据
            } else {
              // Otherwise, do not truncate the table, instead drop and recreate it
              dropTable(conn, options.table)
              createTable(conn, df, options)
              saveTable(df, Some(df.schema), isCaseSensitive, options)
            }

          case SaveMode.Append =>
            val tableSchema = JdbcUtils.getSchemaOption(conn, options)
            saveTable(df, tableSchema, isCaseSensitive, options)

          case SaveMode.ErrorIfExists =>
            throw new AnalysisException(
              s"Table or view '${options.table}' already exists. SaveMode: ErrorIfExists.")

          case SaveMode.Ignore =>
            // With `SaveMode.Ignore` mode, if table already exists, the save operation is expected
            // to not save the contents of the DataFrame and to not change the existing data.
            // Therefore, it is okay to do nothing here and then just return the relation below.
        }
      } else {
        createTable(conn, df, options)
        saveTable(df, Some(df.schema), isCaseSensitive, options)
      }
    } finally {
      conn.close()
    }

    createRelation(sqlContext, parameters)
  }
}

truncateTable

最后看一下truncateTable实现原理，这里其实是执行的TRUNCATE TABLE命令


/**
 * Truncates a table from the JDBC database.
 */
def truncateTable(conn: Connection, table: String): Unit = {
  val statement = conn.createStatement
  try {
    statement.executeUpdate(s"TRUNCATE TABLE $table")
  } finally {
    statement.close()
  }
}

总结

本来主要讲了如何实现Spark在不删除表结构的情况下进行overwrite覆盖写入mysql表，并跟踪一下源码，了解其实现原理。代码层面主要是加了一个参数.option(“truncate”, true)，源码层面
主要逻辑是先判断表是否存在，如果表存在，然后判断truncate是否为true，如果为true，则不drop表，而是执行TRUNCATE TABLE表里，清空表数据然后再写表，这样就实现了我们的需求

本文由 董可伦 发表于伦少的博客 ,采用署名-非商业性使用-禁止演绎 3.0进行许可。

非商业转载请注明作者及出处。商业转载请联系作者本人。

本文标题：Spark覆盖写入mysql表但不改变已有的表结构

本文链接：https://dongkelun.com/2021/04/29/SparkMysqlOverwriteTruncateTable/

欢迎关注我的公众号