Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] orca test cases failed due to server closed the connection unexpectedly #669

Open
1 of 2 tasks
congxuebin opened this issue Oct 15, 2024 · 10 comments · May be fixed by #708
Open
1 of 2 tasks

[Bug] orca test cases failed due to server closed the connection unexpectedly #669

congxuebin opened this issue Oct 15, 2024 · 10 comments · May be fixed by #708
Assignees
Labels
priority: High After critical issues are fixed, these should be dealt with before any further issues. type: Bug Something isn't working type: Orca only orca has the issue

Comments

@congxuebin
Copy link
Contributor

congxuebin commented Oct 15, 2024

Cloudberry Database version

Cloudberry Database 1.7.0+dev.23.g200e3561 build 88554 commit:200e3561

What happened

+WARNING: terminating connection because of crash of another server process
+server closed the connection unexpectedly

  • This probably means the server terminated abnormally
  • before or while processing the request.
    +connection to server was lost

parallel group (8 tests): qp_executor qp_with_clause qp_olap_window qp_misc_jiras qp_olap_windowerr qp_bitmapscan qp_derived_table qp_dropped_cols
qp_misc_jiras ... FAILED (test process exited with exit code 2) 903 ms (diff 714 ms)
qp_with_clause ... FAILED (test process exited with exit code 2) 897 ms (diff 1164 ms)
qp_executor ... ok 206 ms (diff 85 ms)
qp_olap_windowerr ... FAILED (test process exited with exit code 2) 905 ms (diff 796 ms)
qp_olap_window ... FAILED (test process exited with exit code 2) 902 ms (diff 8355 ms)
qp_derived_table ... FAILED (test process exited with exit code 2) 913 ms (diff 13612 ms)
qp_bitmapscan ... FAILED (test process exited with exit code 2) 911 ms (diff 1912 ms)
qp_dropped_cols ... FAILED (test process exited with exit code 2) 916 ms (diff 1505 ms)

What you think should happen instead

No response

How to reproduce

make -k PGOPTIONS='-c optimizer=on' installcheck-good

Operating System

centos7

Anything else

No response

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

@congxuebin congxuebin added type: Bug Something isn't working priority: High After critical issues are fixed, these should be dealt with before any further issues. labels Oct 15, 2024
@gfphoenix78
Copy link
Contributor

@congxuebin could you provide more details about the crash?

@congxuebin
Copy link
Contributor Author

congxuebin commented Oct 29, 2024

@gfphoenix78 Hi Hao,

The crash occurred when creating table. But simply running the test case qp_misc_jiras won't recreate the problem. You can recreate thru the following test.

PGOPTIONS='-c optimizer=on'
cd /code/cbdb_src/src/test/regress/results
make installcheck-good
CREATE TABLE qp_misc_jiras.tbl1544_child_depth_1_y_2000_year (
CONSTRAINT tbl1544_child_depth_1_y_2000_year_pdate_check
CHECK (pdate >= '2000-01-01'::date AND pdate < '2001-01-01'::date))
INHERITS (qp_misc_jiras.tbl1544)
;
NOTICE:  table has parent, setting distribution columns to match parent table
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost

Uploading output-issue-669.zip…

@edespino
Copy link
Contributor

edespino commented Nov 7, 2024

I also ran into this issue on Rocky Linux 8 & 9.

make installcheck PGOPTIONS='-c optimizer=on'

@my-ship-it - Do we know what change introduced this issue? Do we know when this will be fixed? We need to get this fixed as soon as possible.

@my-ship-it
Copy link
Contributor

@gfphoenix78 Could you please help on it, thanks!

@Smyatkin-Maxim
Copy link

Just for the record:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=133989073607040) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=133989073607040) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=133989073607040, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x000079dcc3a42476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  0x000079dcc4b75c5f in StandardHandlerForSigillSigsegvSigbus_OnMainThread (processName=0x79dcc510ea43 "Master process", postgres_signal_arg=11) at elog.c:5377
#5  0x000079dcc49842ef in CdbProgramErrorHandler (postgres_signal_arg=11) at postgres.c:3812
#6  <signal handler called>
#7  0x000079dcc4dfc7d8 in gpopt::CExpression::DeriveHasNonScalarFunction (this=0x0) at CExpression.cpp:1512
#8  0x000079dcc4e0b99d in gpopt::CExpressionPreprocessor::PexprTransposeSelectAndProject (mp=0x59258040c620, pexpr=0x5925815de2b0) at CExpressionPreprocessor.cpp:2847
#9  0x000079dcc4e0b7c1 in gpopt::CExpressionPreprocessor::PexprTransposeSelectAndProject (mp=0x59258040c620, pexpr=0x5925815bc770) at CExpressionPreprocessor.cpp:2892
#10 0x000079dcc4e10b92 in gpopt::CExpressionPreprocessor::PexprPreprocess (mp=mp@entry=0x59258040c620, pexpr=pexpr@entry=0x592580a15cf0, pcrsOutputAndOrderCols=pcrsOutputAndOrderCols@entry=0x5925815b6760)
    at CExpressionPreprocessor.cpp:3179
#11 0x000079dcc4dc780f in gpopt::CQueryContext::CQueryContext (this=0x5925815b6640, mp=0x59258040c620, pexpr=0x592580a15cf0, prpp=<optimized out>, colref_array=0x5925815b5f60, 
    pdrgpmdname=<optimized out>, fDeriveStats=true) at CQueryContext.cpp:65
#12 0x000079dcc4dc7d9d in gpopt::CQueryContext::PqcGenerate (mp=mp@entry=0x59258040c620, pexpr=pexpr@entry=0x592580a15cf0, pdrgpulQueryOutputColRefId=<optimized out>, 
    pdrgpmdname=pdrgpmdname@entry=0x59258065c7b0, fDeriveStats=fDeriveStats@entry=true) at CQueryContext.cpp:259
#13 0x000079dcc4e6e25c in gpopt::COptimizer::PdxlnOptimize (mp=mp@entry=0x59258040c620, md_accessor=md_accessor@entry=0x7fffbee5ce30, query=query@entry=0x592580614f00, 
    query_output_dxlnode_array=query_output_dxlnode_array@entry=0x592580614c70, cte_producers=cte_producers@entry=0x592580734b00, pceeval=pceeval@entry=0x592580831d60, ulHosts=3, ulSessionId=315, 
    ulCmdId=34, search_stage_array=0x0, optimizer_config=0x592580615bb8, szMinidumpFileName=0x0) at COptimizer.cpp:297
#14 0x000079dcc4f6b016 in COptTasks::OptimizeTask (ptr=<optimized out>) at COptTasks.cpp:573
#15 0x000079dcc4c7930d in gpos::CTask::Execute (this=this@entry=0x5925805706c0) at CTask.cpp:130
#16 0x000079dcc4c7a2da in gpos::CWorker::Execute (this=0x7fffbee5d470, task=task@entry=0x5925805706c0) at CWorker.cpp:80
#17 0x000079dcc4c78a60 in gpos::CAutoTaskProxy::Execute (this=this@entry=0x7fffbee5d4a0, task=task@entry=0x5925805706c0) at CAutoTaskProxy.cpp:286
#18 0x000079dcc4c7aeb2 in gpos_exec (params=0x7fffbee5d530) at _api.cpp:237
#19 0x000079dcc4f692c7 in COptTasks::Execute (func=0x79dcc4f6ac90 <COptTasks::OptimizeTask(void*)>, func_arg=0x7fffbee5d5b0) at COptTasks.cpp:234
#20 0x000079dcc4f6a2cc in COptTasks::GPOPTOptimizedPlan (query=query@entry=0x5925805a8f90, gpopt_context=gpopt_context@entry=0x7fffbee5d5b0) at COptTasks.cpp:770
#21 0x000079dcc4f6c25f in CGPOptimizer::GPOPTOptimizedPlan (query=0x5925805a8f90, had_unexpected_failure=0x7fffbee5d647) at CGPOptimizer.cpp:58
#22 0x000079dcc4839340 in optimize_query (parse=0x592580839180, cursorOptions=2048, boundParams=0x0) at orca.c:160
#23 0x000079dcc481a488 in standard_planner (parse=0x592580839180, 
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"..., cursorOptions=2048, boundParams=0x0) at planner.c:392
#24 0x000079dcc481a33f in planner (parse=0x592580839180, 
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"..., cursorOptions=2048, boundParams=0x0) at planner.c:333
#25 0x000079dcc497f0c2 in pg_plan_query (querytree=0x592580839180, 
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"..., cursorOptions=2048, boundParams=0x0) at postgres.c:995
#26 0x000079dcc497f21e in pg_plan_queries (querytrees=0x59258048ff28, 
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"..., cursorOptions=2048, boundParams=0x0) at postgres.c:1087
#27 0x000079dcc4980d17 in exec_simple_query (
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"...) at postgres.c:1854
#28 0x000079dcc4986db6 in PostgresMain (argc=1, argv=0x7fffbee5dcb0, dbname=0x592580225ba0 "regression", username=0x592580225b80 "smiatkin") at postgres.c:5595
#29 0x000079dcc48a7a3f in BackendRun (port=0x592580218720) at postmaster.c:5126
#30 0x000079dcc48a719f in BackendStartup (port=0x592580218720) at postmaster.c:4830
#31 0x000079dcc48a26aa in ServerLoop () at postmaster.c:2051
#32 0x000079dcc48a1c06 in PostmasterMain (argc=7, argv=0x5925801f1b10) at postmaster.c:1676
#33 0x000059258001baaa in main (argc=7, argv=0x5925801f1b10) at main/main.c:270

@leborchuk
Copy link
Contributor

leborchuk commented Nov 8, 2024

I cannot checked if it helps or not because the issue does not reproduce in my dev env (still try to do it), but looked up the last commits and see that we cherry-picked Fix predicate pushdown using cast'd column (#13770)

but did not took [Fix qp_with_clause testcase without asserts (#13878)] (open-gpdb/gpdb@fad65d7)

where bug in [CExpressionPreprocessor.cpp] line 2846 was fixed (see bt Maxim provided)

It looks like we should cherry-picked #13878 too

@leborchuk
Copy link
Contributor

Added #708 to launch tests while checking it in my env

@gfphoenix78
Copy link
Contributor

I cannot checked if it helps or not because the issue does not reproduce in my dev env (still try to do it), but looked up the last commits and see that we cherry-picked Fix predicate pushdown using cast'd column (#13770)

but did not took [Fix qp_with_clause testcase without asserts (#13878)] (open-gpdb/gpdb@fad65d7)

where bug in [CExpressionPreprocessor.cpp] line 2846 was fixed (see bt Maxim provided)

It looks like we should cherry-picked #13878 too

Thank you @leborchuk , I'll check whether it works with this patch. I doesn't repo this issue in my current env. Will test on other envs.

@gfphoenix78 gfphoenix78 added the type: Orca only orca has the issue label Nov 8, 2024
@gfphoenix78
Copy link
Contributor

I also ran into this issue on Rocky Linux 8 & 9.

make installcheck PGOPTIONS='-c optimizer=on'

@my-ship-it - Do we know what change introduced this issue? Do we know when this will be fixed? We need to get this fixed as soon as possible.

Hi, Ed, I couldn't repro this issue on my Rocky Linux 9. Could you repro the crash on your env? If yes, you may try @leborchuk 's PR #708

@edespino
Copy link
Contributor

@gfphoenix78

You should be able to reproduce the issue by building HEAD of main with the following configure options. You will need to update it for your environment. FYI: I build xerces-c from source instead of pulling from epel and that is why my configure command is the way it is.

        cd ~/cloudberry
        export LD_LIBRARY_PATH=/usr/local/cloudberry-db/lib:LD_LIBRARY_PATH
        ./configure --prefix=/usr/local/cloudberry-db \
                    -disable-external-fts \
                    --enable-gpcloud \
                    --enable-ic-proxy \
                    --enable-mapreduce \
                    --enable-orafce \
                    --enable-orca \
                    --enable-pxf \
                    --enable-tap-tests \
                    --with-gssapi \
                    --with-ldap \
                    --with-libxml \
                    --with-lz4 \
                    --with-openssl \
                    --with-pam \
                    --with-perl \
                    --with-pgport=5432 \
                    --with-python \
                    --with-pythonsrc-ext \
                    --with-ssl=openssl \
                    --with-openssl \
                    --with-uuid=e2fs \
                    --with-includes=/usr/local/xerces-c/include \
                    --with-libraries=/usr/local/cloudberry-db/lib | tee configure-$(date "+%Y.%m.%d-%H.%M.%S").log

Here is the command I use to execute installcheck:

make installcheck PGOPTIONS='-c optimizer=on' --directory=~/cloudberry

@leborchuk leborchuk linked a pull request Nov 10, 2024 that will close this issue
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: High After critical issues are fixed, these should be dealt with before any further issues. type: Bug Something isn't working type: Orca only orca has the issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants