-
Notifications
You must be signed in to change notification settings - Fork 244
TiSpark FAQ
Q: What are the pros and cons of independent deployment as opposed to a shared resource with an existing Spark / Hadoop cluster?
A: You can use the existing Spark cluster without a separate deployment, but if the existing cluster is busy, TiSpark will not be able to achieve the desired speed.
Q: Can I mix Spark with TiKV?
A: If TiDB and TiKV are overloaded and run critical online tasks, consider deploying TiSpark separately.
You also need to consider using different NICs to ensure that OLTP's network resources are not compromised so that online business is not affected.
If the online business requirements are not high or the loading is not large enough, you can mix TiSpark with TiKV deployment.
Q: How to use PySpark with TiSpark?
A: Follow TiSpark on PySpark.
Q: What can I do if warning: WARN ObjectStore:568 - Failed to get database
is returned when executing SQL statements using TiSpark?
A: You can ignore this warning. It occurs because Spark tries to load two nonexistent databases (default
and global_temp
) in its catalog. If you want to mute this warning, modify log4j by adding log4j.logger.org.apache.hadoop.hive.metastore.ObjectStore=ERROR
to the log4j
file in tispark/conf
. You can add the parameter to the log4j
file of the config
under Spark. If the suffix is template
, you can use the mv
command to change it to properties
.
Q: What can I do if java.sql.BatchUpdateException: Data Truncated
is returned when executing SQL statements using TiSpark?
A: This error occurs because the length of the data written exceeds the length of the data type defined by the database. You can check the field length and adjust it accordingly.
Q: Does TiSpark read Hive metadata by default?
A: By default, TiSpark searches for the Hive database by reading the Hive metadata in hive-site. If the search task fails, it searches for the TiDB database instead, by reading the TiDB metadata.
If you do not need this default behavior, do not configure the Hive metadata in hive-site.
Q: What can I do if Error: java.io.InvalidClassException: com.pingcap.tikv.region.TiRegion; local class incompatible: stream classdesc serialVersionUID ...
is returned when TiSpark is executing a Spark task?
A: The error message shows a serialVersionUID
conflict, which occurs because you have used class
and TiRegion
of different versions. Because TiRegion
only exists in TiSpark, multiple versions of TiSpark packages might be used. To fix this error, you need to make sure the version of TiSpark dependency is consistent among all nodes in the cluster.
With TiSpark 3.x, you must to specify catalog.
//1. with catalog prefix
spark.sql("select from tidb_catalog.$database.table")
//2. use catalog
spark.sql("use tidb_catalog")
spark.sql("select from $database.table")
https://github.com/pingcap/tispark/wiki/Getting-Started#use-with-spark_catalog
Check the scala version and choose the right version of TiSpark https://github.com/pingcap/tispark/wiki/Getting-TiSpark#choose-the-version-of-tispark
It may occur when you forget to configure the following configuration
spark.sql.extensions org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses ${your_pd_adress}
Netty's PoolThreadCache
may hold some unused memory, which may cause the following error.
Caused by: shade.io.netty.handler.codec.DecoderException: shade.io.netty.util.internal.OutOfDirectMemoryError
The following configurations can be used to avoid the error.
--conf "spark.driver.extraJavaOptions=-Dshade.io.netty.allocator.type=unpooled"
--conf "spark.executor.extraJavaOptions=-Dshade.io.netty.allocator.type=unpooled"
The following configurations can be used to avoid the garbled chinese characters problem.
--conf "spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8"
--conf "spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8"
The maximum message size of GRPC java lib is 2G. The following error will be thrown if there is a huge region in TiKV whose size is more than 2G.
Caused by: shade.io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 2147483647
Use SHOW TABLE [table_name] REGIONS [WhereClauseOptional]
to check whether there is a huge region in TiKV.
For the users of Spark 2.1 who wish to upgrade to the latest TiSpark version on Spark 2.3/2.4, download or install Spark 2.3+/2.4+ by following the instructions on Apache Spark Site and overwrite the old spark version in $SPARK_HOME
.