-
Notifications
You must be signed in to change notification settings - Fork 244
TiSpark with HDFS
shiyuhang0 edited this page Jun 20, 2022
·
1 revision
This article introduces how to use TiSpark with HDFS.
- create TiDB Table
CREATE TABLE `test`.`tidb` (
`id` int(11) NOT NULL,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
);
- create data.csv
take 0x01
delimiter and \n
newline as example
2^Ashi
3^Ayu
- load the CSV to hdfs
hdfs dfs -put data.csv /
- start spark-shell
./bin/spark-shell --jars tispark-assembly-3.0-2.5.1.jar
you can also use Spark JDBC DataSource to do it, see here
import org.apache.spark.sql.types._
// read from hdfs
val schema = new StructType().add("id",IntegerType).add("name",StringType)
val df = spark.read.format("csv").option("delimiter","\u0001").schema(schema).load("hdfs://${ip:port}/data.csv")
// write to tidb
val tidbOptions = Map(
"tidb.addr" -> "${ip}",
"tidb.password" -> "",
"tidb.port" -> "4000",
"tidb.user" -> "root"
)
df.write.format("tidb").options(tidbOptions).option("database", "test").option("table", "tidb").mode("append").save()
// if we want to replace
df.write.format("tidb").options(tidbOptions).option("database", "test").option("table", "tidb").option("replace","true").mode("append").save()
you can also use Spark JDBC DataSource to do it, see here
// read from tidb
val df = spark.sql("select * from tidb_catalog.test.tidb")
// write to hdfs
df.write.format("csv").option("delimiter","\u0001").save("hdfs://${ip:port}/save")
Spark support read from the zipped file in Hadoop directly, take gz file for example:
val schema = new StructType().add("id",IntegerType).add("name",StringType)
val df = spark.read.format("csv").option("delimiter","\u0001").schema(schema).load("hdfs://${ip:port}/data.csv.gz")