Skip to content

TiSpark FAQ

shiyuhang0 edited this page Sep 20, 2022 · 7 revisions

FAQ

Q: What are the pros and cons of independent deployment as opposed to a shared resource with an existing Spark / Hadoop cluster?

A: You can use the existing Spark cluster without a separate deployment, but if the existing cluster is busy, TiSpark will not be able to achieve the desired speed.

Q: Can I mix Spark with TiKV?

A: If TiDB and TiKV are overloaded and run critical online tasks, consider deploying TiSpark separately.

You also need to consider using different NICs to ensure that OLTP's network resources are not compromised so that online business is not affected.

If the online business requirements are not high or the loading is not large enough, you can mix TiSpark with TiKV deployment.

Q: How to use PySpark with TiSpark?

A: Follow TiSpark on PySpark.

Q: What can I do if warning: WARN ObjectStore:568 - Failed to get database is returned when executing SQL statements using TiSpark?

A: You can ignore this warning. It occurs because Spark tries to load two nonexistent databases (default and global_temp) in its catalog. If you want to mute this warning, modify log4j by adding log4j.logger.org.apache.hadoop.hive.metastore.ObjectStore=ERROR to the log4j file in tispark/conf. You can add the parameter to the log4j file of the config under Spark. If the suffix is template, you can use the mv command to change it to properties.

Q: What can I do if java.sql.BatchUpdateException: Data Truncated is returned when executing SQL statements using TiSpark?

A: This error occurs because the length of the data written exceeds the length of the data type defined by the database. You can check the field length and adjust it accordingly.

Q: Does TiSpark read Hive metadata by default?

A: By default, TiSpark searches for the Hive database by reading the Hive metadata in hive-site. If the search task fails, it searches for the TiDB database instead, by reading the TiDB metadata.

If you do not need this default behavior, do not configure the Hive metadata in hive-site.

Q: What can I do if Error: java.io.InvalidClassException: com.pingcap.tikv.region.TiRegion; local class incompatible: stream classdesc serialVersionUID ... is returned when TiSpark is executing a Spark task?

A: The error message shows a serialVersionUID conflict, which occurs because you have used class and TiRegion of different versions. Because TiRegion only exists in TiSpark, multiple versions of TiSpark packages might be used. To fix this error, you need to make sure the version of TiSpark dependency is consistent among all nodes in the cluster.

Error

[Error] NoSuchDatabaseException when upgrade to TiSpark 3.x

With TiSpark 3.x, you must to specify catalog.

//1. with catalog prefix
spark.sql("select from tidb_catalog.$database.table")

//2. use catalog
spark.sql("use tidb_catalog")
spark.sql("select from $database.table")

https://github.com/pingcap/tispark/wiki/Getting-Started#use-with-spark_catalog

[Error] java.lang.NoSuchMethodError: scala.Function1.$init$(Lscala/Function1;)V

Check the scala version and choose the right version of TiSpark image https://github.com/pingcap/tispark/wiki/Getting-TiSpark#choose-the-version-of-tispark

[Error]Batch scan are not supported / Table does not support reads

It may occur when you forget to configure the following configuration

spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}

Netty OutOfDirectMemoryError

Netty's PoolThreadCache may hold some unused memory, which may cause the following error.

Caused by: shade.io.netty.handler.codec.DecoderException: shade.io.netty.util.internal.OutOfDirectMemoryError

The following configurations can be used to avoid the error.

--conf "spark.driver.extraJavaOptions=-Dshade.io.netty.allocator.type=unpooled"
--conf "spark.executor.extraJavaOptions=-Dshade.io.netty.allocator.type=unpooled"

Chinese characters are garbled

The following configurations can be used to avoid the garbled chinese characters problem.

--conf "spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8"
--conf "spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8"

GRPC message exceeds maximum size error

The maximum message size of GRPC java lib is 2G. The following error will be thrown if there is a huge region in TiKV whose size is more than 2G.

Caused by: shade.io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 2147483647

Use SHOW TABLE [table_name] REGIONS [WhereClauseOptional] to check whether there is a huge region in TiKV.

Others

How to upgrade from Spark 2.1 to Spark 2.3/2.4

For the users of Spark 2.1 who wish to upgrade to the latest TiSpark version on Spark 2.3/2.4, download or install Spark 2.3+/2.4+ by following the instructions on Apache Spark Site and overwrite the old spark version in $SPARK_HOME.