Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for #24805: Do not unenlist XA transactions if related to the current Tx #25131

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

OndroMih
Copy link
Contributor

@OndroMih OndroMih commented Sep 6, 2024

Hopefully fixes #24805.

I think the problem with #24805 is:

  • Multiple resources (e.g. DB connections) are enlisted in a transaction
  • As resources are delisted, they are removed from the transaction but not delisted
  • As transaction commits, resources are delisted
  • As resources are closed (released to the pool), they are delisted if they are not the non-XA resource in the current transaction

I suspect that sometimes XA resources are closed, which unenlists them, and they are reused by another transaction before the previous transaction completes and delists the resources. Therefore a new transaction uses a resources, which is already marked as enlisted and doesn't attempt to enlist it again. This also means that the new transaction doesn't call the enlistResource method, and doesn't set the non-XA resource and keeps it null. This leads to a NullPtr later.

The NullPtr happens very rarely, I think only if a new transaction asks for the resource between the time when it's closed and previous Tx is completed, and only if the new transaction contains this single resource. Then no other resource calls the enlistResource method and the non-XA resource remains null, which leads to an exception when delisting the single resource in the new transaction.

This PR should keep all XA resources with the transaction if they are still associated with the transaction, and then delist all the resources at the end of the transaction. Otherwise resources are prematurely removed from Tx before Tx commits and delists them.

@escay, can you please try this?

@escay
Copy link
Contributor

escay commented Sep 7, 2024

@OndroMih Thank you, I will test it in the coming few days.

@escay
Copy link
Contributor

escay commented Sep 9, 2024

@OndroMih First tests seem to result in the same exception as before: NPE in ConnectorXAResource.

Caused by: java.lang.RuntimeException: Got exception during XAResource.end: javax.transaction.xa.XAException: com.sun.appserv.connectors.internal.api.PoolingException: java.lang.NullPointerException: Cannot invoke "com.sun.enterprise.resource.ResourceHandle.getResourceState()" because "resourceHandle" is null
	at com.sun.enterprise.transaction.JavaEETransactionManagerSimplified.delistResource(JavaEETransactionManagerSimplified.java:522) ~[jta.jar:?]
	at com.sun.enterprise.resource.rm.ResourceManagerImpl.unregisterResource(ResourceManagerImpl.java:278) ~[connectors-runtime.jar:?]
	at com.sun.enterprise.resource.rm.ResourceManagerImpl.delistResource(ResourceManagerImpl.java:229) ~[connectors-runtime.jar:?]
	at com.sun.enterprise.resource.pool.PoolManagerImpl.resourceClosed(PoolManagerImpl.java:368) ~[connectors-runtime.jar:?]
	at com.sun.enterprise.resource.listener.LocalTxConnectionEventListener.connectionClosed(LocalTxConnectionEventListener.java:86) ~[connectors-runtime.jar:?]
	at com.sun.gjc.spi.ManagedConnectionImpl.connectionClosed(ManagedConnectionImpl.java:735) ~[jdbc-ra-ds.jar:?]
	at com.sun.gjc.spi.base.ConnectionHolder.close(ConnectionHolder.java:197) ~[jdbc-ra-ds.jar:?]
	at com.sun.gjc.spi.jdbc40.ConnectionHolder40.close(ConnectionHolder40.java:589) ~[jdbc-ra-ds.jar:?]
  • I build main codeline, 7.0.18 locally, installed it, ran test: test failed. (Just to be sure it is reproducable in 7.0.18)
  • I build your change, updated connectors-runtime module in 7.0.18 installation. (/glassfish/modules/connectors-runtime.jar replaced)
  • I made sure domains\mydomain\osgi-cache\felix was cleaned before I started the application server with my domain. (Not sure if this is needed or not).

I will do some more tests in the coming two days.

@escay
Copy link
Contributor

escay commented Sep 12, 2024

@OndroMih I changed your fix to add some info logging:

    /**
     * Check whether the local resource in question is the one participating in transaction.
     *
     * @param h ResourceHandle
     * @return true if the resource is participating in the transaction
     */
    public boolean isLocalResourceInTransaction(ResourceHandle handle) {
        boolean result = true;
        try {
            JavaEETransaction txn = (JavaEETransaction) ConnectorRuntime.getRuntime().getTransaction();
            if (txn != null) {
                boolean nonXAResourceInTransaction = isNonXAResourceInTransaction(txn, handle);
                boolean poolInfoContainsHandle = txn.getResources(poolInfo).contains(handle);
                _logger.log(Level.INFO, "handle: " + handle + ", nonXAResourceInTransaction: " + nonXAResourceInTransaction + ", poolInfoContainsHandle=" + poolInfoContainsHandle);
                result = nonXAResourceInTransaction && poolInfoContainsHandle;
            }
        } catch (SystemException e) {
            if (_logger.isLoggable(Level.FINE)) {
                _logger.log(Level.FINE, "Exception while checking whether the resource [ of pool : " + poolInfo + " ] "
                        + "is nonxa and is enlisted in transaction : ", e);
            }
        }
        return result;
    }

I can find:

[2024-09-12T11:05:55.422774+02:00] [GF 7.0.18-SNAPSHOT] [INFO] [] [jakarta.enterprise.resource.resourceadapter.com.sun.enterprise.resource.pool] [tid: _ThreadID=1 _ThreadName=main] [levelValue: 800] [[
  handle: 10, nonXAResourceInTransaction: false, poolInfoContainsHandle=true]]

[2024-09-12T11:05:55.422774+02:00] [GF 7.0.18-SNAPSHOT] [INFO] [] [jakarta.enterprise.resource.resourceadapter.com.sun.enterprise.resource.pool] [tid: _ThreadID=1 _ThreadName=main] [levelValue: 800] [[
  handle: 9, nonXAResourceInTransaction: true, poolInfoContainsHandle=true]]

but there are no cases logged with
poolInfoContainsHandle=false
So I do not think this will fix my situation sicne the && false is never applied.
(I also keep getting the original exception in 4 test runs, which confirms this as well.)

Level of order: about 658 calls to isLocalResourceInTransaction before my test fails.

Resource is in Tx if it's the non-XA resource attached to the Tx OR if it's attached to the poolInto that is attached to the Tx.

Maybe it's enough to keep only the second condition, it seems that non-XA resources are also in the poolInfo but not sure.
@OndroMih
Copy link
Contributor Author

OndroMih commented Sep 12, 2024

I looked at my code again and I got the logical condition wrong. It should be OR and not AND. The logic is: local resource is in transaction if it's a non-XA resource associated with the Tx, OR (not AND) it's a resource associated with the pool of resources in the Tx. In short, if the resource, whether it is XA or non-XA, is associated with the transaction. According to the data from the log, the whole condition would be true for both the XA resource (handle 10) and non-XA one (handle 9), and therefore all of them should be kept with the transaction until the transaction commits.

I think it would be enough to keep only the new condition txn.getResources(poolInfo).contains(handle) because it seems to be true also for non-XA resources. but first we need to know whether this fixes the issue or not.

@escay, can you try again with my latest change?

@escay
Copy link
Contributor

escay commented Sep 12, 2024

@OndroMih I will try it, but I wonder if it would help if all calls I see have poolInfoContainsHandle=true, meaning this method is called as if code was:

            if (txn != null) {
                result = true;
            }

but let me try.

Note I only see 3 occurrences logged with nonXAResourceInTransaction: false:

	handle: 10, nonXAResourceInTransaction: false, poolInfoContainsHandle=true]]
	handle: 10, nonXAResourceInTransaction: false, poolInfoContainsHandle=true]]
	handle: 10, nonXAResourceInTransaction: false, poolInfoContainsHandle=true]]

and this is logged before our application really started / before "Loading application .. done in .. ms" message.
Referring to another handle (10) as the thread with the exception (9):

[2024-09-12T12:28:13.343910+02:00] [GF 7.0.18-SNAPSHOT] [INFO] [] [jakarta.enterprise.resource.resourceadapter.com.sun.enterprise.resource.pool] [tid: _ThreadID=83 _ThreadName=http-listener-1(5)] [levelValue: 800] [[
  handle: 9, nonXAResourceInTransaction: true, poolInfoContainsHandle=true]]

[2024-09-12T12:28:13.345874+02:00] [GF 7.0.18-SNAPSHOT] [SEVERE] [poolmgr.system_exception] [jakarta.enterprise.resource.resourceadapter.com.sun.enterprise.resource] [tid: _ThreadID=83 _ThreadName=http-listener-1(5)] [levelValue: 1000] [[
  RAR5031:System Exception
java.lang.NullPointerException: Cannot invoke "com.sun.enterprise.resource.ResourceHandle.getResourceState()" because "resourceHandle" is null

@escay
Copy link
Contributor

escay commented Sep 12, 2024

@OndroMih Your change certainly seems to help 😍

I ran my test set 4 times in a row without redeployment: all ok.

  • 15229 times a call to isLocalResourceInTransaction.
  • No nullpointer exceptions.
  • No database resouce leaks detected
  • No resource "busy" state detected.

Code I used locally:

boolean nonXAResourceInTransaction = isNonXAResourceInTransaction(txn, handle);
boolean poolInfoContainsHandle = txn.getResources(poolInfo).contains(handle);
result = nonXAResourceInTransaction || poolInfoContainsHandle;
_logger.log(Level.INFO, "handle: " + handle + ", nonXAResourceInTransaction: " + nonXAResourceInTransaction + ", poolInfoContainsHandle: " + poolInfoContainsHandle + ", result: " + result);

@escay
Copy link
Contributor

escay commented Sep 12, 2024

@OndroMih I ran a bigger test set. About 380.000 calls so far to isLocalResourceInTransaction and everything still ok.
This set set uses all kinds of transactions in parallel.

@OndroMih
Copy link
Contributor Author

Thank you for your quick feedback, @escay. I'm glad this fixes the issue.

Yes, poolInfoContainsHandle is always true, as long as the resource is associated to the transaction. I'm not sure if the method is called under different conditions. In theory, something might want to reuse the resource after the transaction is completed and before the resource is put back to the pool. Then I guess poolInfoContainsHandle would be false and the resource would be delisted. Although, in that case, it should be already delisted after the transaction completes. The code is not easy to understand and it might be doing things that are not necessary. But at least we solved this issue with null pointer. I hope this solution doesn't have any side-effects. They would be hard to discover by any test suite and therefore I'm glad you were able to test it extensively.

I'll turn this into a proper pull request now.

@OndroMih OndroMih marked this pull request as ready for review September 13, 2024 15:21
@OndroMih
Copy link
Contributor Author

OndroMih commented Sep 13, 2024

The build failed twice in a row in an Ant step "execute-sql-common" on error connecting to Derby DB from Ant script. Anybody knows why it could be?

@OndroMih
Copy link
Contributor Author

Locally the test suite failed too, when I executed mvn clean install -T4C -Pfastest && ./runtests.sh jdbc_all. In the failed tests, I see a few errors in the server log:

  • I see the connection refused errors to Derby, even some communication error with Derby ("Insufficient data while reading from the network")
  • some null pointer exceptions in "Cleanup of a resource from pool" from ConnectionPool.cleanupResource
  • "Cannot use more than one local-tx resource in unshareable scope"

I ran the test suite without this change and the tests passed locally. It's either a side-effect of this fix, or a bug in Derby triggered by this fix.

@escay
Copy link
Contributor

escay commented Sep 27, 2024

I can reproduce the error locally, if I run mvn clean install -T4C -Pfastest && ./runtests.sh jdbc_all without the fix, the tests succeed.
If I run the tests with the fix they fail.
If I run the tests 2nd time with the fix they fail again on same item.
If I revert the fix the tests succeed.

First failure I see is in: test jdbc-connsharingnonxaApp, test starts to hang and tries every 1 minute to get a connection.
Exception: "Reason: No available resources and wait time 60000 ms expired." stacktrace-first-failing-test.txt

I plan to try and understand the test case. I have not seen NullPointerExceptions in the server.log

@escay
Copy link
Contributor

escay commented Oct 1, 2024

Single test reproduction also succeeds.
Change /glassfish/appserver/tests/appserv-tests/devtests/jdbc/build.xml to run only the single test "connsharing":

    <target name="run-tests">
	    <ant dir="connsharing" target="nonxa"/>
    </target>

and run ./runtests.sh jdbc_all

Test hangs on getting a connection from the pool.
The server log is available at /projects/glassfish/target/glassfish7/glassfish/domains/domain1/logs/server.log with the test output.

@escay
Copy link
Contributor

escay commented Oct 2, 2024

The test fails in

Bean1 SimpleSessionBean starts the transaction, gets a connection and makes 100 bean calls:

// Default transaction: Required -> new transaction is started
public boolean test6() throws Exception {
    conn1 = ds.getConnection();
    SimpleSession2 bean = home.create();
    for (int i = 0; i < 100; i++) {
        bean.test3();
    }
    conn1.close();
}
	
Bean2 SimpleSession2Bean: 

// Default transaction: Required -> existing local transaction is reused
public boolean test3() throws Exception {
    conn1 = ds.getConnection();
    conn1.close();
}

Without the fix: all 101 real database connection resources are the same, so eventually all 101 calls use the same database connection.

With the fix: all real database connection resources are also the same, but after call 31 (32 calls in total?) exception occurs:

[2024-10-02T12:13:57.147489+02:00] [GF 7.0.18-SNAPSHOT] [INFO] [] [jakarta.enterprise.logging.stdout] [tid: _ThreadID=116 _ThreadName=p: thread-pool-1; w: 2] [levelValue: 800] [[
  ### test3 - called from bean1 test(6) - bean2 conn1: org.apache.derby.client.net.NetConnection@7baa1d4d]]

[2024-10-02T12:13:57.152366+02:00] [GF 7.0.18-SNAPSHOT] [INFO] [] [jakarta.enterprise.logging.stdout] [tid: _ThreadID=116 _ThreadName=p: thread-pool-1; w: 2] [levelValue: 800] [[
  ### test3 - called from bean1 test(6) - bean2 conn1: org.apache.derby.client.net.NetConnection@7baa1d4d]]
  
[2024-10-02T12:14:57.156248+02:00] [GF 7.0.18-SNAPSHOT] [WARNING] [] [jakarta.enterprise.resource.resourceadapter.com.sun.gjc.spi] [tid: _ThreadID=116 _ThreadName=p: thread-pool-1; w: 2] [levelValue: 900] [[
  Error allocating connection: [Failed to obtain/create connection from connection pool [jdbc-connsharing-pool]. Reason: No available resources and wait time 60000 ms expired.]]]

[2024-10-02T12:14:57.156920+02:00] [GF 7.0.18-SNAPSHOT] [SEVERE] [] [jakarta.enterprise.logging.stderr] [tid: _ThreadID=116 _ThreadName=p: thread-pool-1; w: 2] [levelValue: 1000] [[
  java.sql.SQLException: Failed to obtain/create connection from connection pool [jdbc-connsharing-pool]. Reason: No available resources and wait time 60000 ms expired.
	at com.sun.gjc.spi.base.AbstractDataSource.getConnection(AbstractDataSource.java:105)
	at com.sun.s1asdev.jdbc.connsharing.nonxa.ejb.SimpleSession2Bean.test3(SimpleSession2Bean.java:112) // lines changed compared to main codeline due to adding some debug output in the test
	...
Caused by: jakarta.resource.spi.ResourceAllocationException: Failed to obtain/create connection from connection pool [jdbc-connsharing-pool]. Reason: No available resources and wait time 60000 ms expired.
	at com.sun.enterprise.connectors.ConnectionManagerImpl.internalGetConnection(ConnectionManagerImpl.java:307)
	at com.sun.enterprise.connectors.ConnectionManagerImpl.allocateConnection(ConnectionManagerImpl.java:237)
	at com.sun.enterprise.connectors.ConnectionManagerImpl.allocateConnection(ConnectionManagerImpl.java:157)
	at com.sun.enterprise.connectors.ConnectionManagerImpl.allocateConnection(ConnectionManagerImpl.java:151)
	at com.sun.gjc.spi.base.AbstractDataSource.getConnection(AbstractDataSource.java:99)
	... 93 more
Caused by: com.sun.appserv.connectors.internal.api.PoolingException: No available resources and wait time 60000 ms expired.
	at com.sun.enterprise.resource.pool.ConnectionPool.getResource(ConnectionPool.java:484)
	at com.sun.enterprise.resource.pool.PoolManagerImpl.getResourceFromPool(PoolManagerImpl.java:221)
	at com.sun.enterprise.resource.pool.PoolManagerImpl.getResource(PoolManagerImpl.java:150)
	at com.sun.enterprise.connectors.ConnectionManagerImpl.getResource(ConnectionManagerImpl.java:355)
	at com.sun.enterprise.connectors.ConnectionManagerImpl.internalGetConnection(ConnectionManagerImpl.java:298)
	... 97 more

Additional info: the datasource: "jdbc-connsharing-pool" does not seem to be initialized with a certain max pool size.

[2024-10-02T13:45:31.591021+02:00] [GF 7.0.18-SNAPSHOT] [INFO] [] [jakarta.enterprise.resource.resourceadapter.com.sun.enterprise.resource.pool] [tid: _ThreadID=113 _ThreadName=p: thread-pool-1; w: 1] [levelValue: 800] [[
  Created connection pool and added it to PoolManager: Pool [org.glassfish.resourcebase.resources.api.PoolInfo@b1ef9dca[jndiName=jdbc-connsharing-pool, applicationName=null, moduleName=null]] PoolSize=0  FreeResources=0  QueueSize=0 matching=o
ff validation=off]]

Max pool size default is configured in the code to be 32:

public class ConnectionPool implements ResourcePool, ConnectionLeakListener, ResourceHandler, PoolProperties {
    /**
     * Represents the "max-pool-size" configuration value.<br>
     * Specifies the maximum number of connections that can be created to satisfy client requests.<br>
     * Default: 32
     */
    protected int maxPoolSize;

This is in line with the failing step.

@escay
Copy link
Contributor

escay commented Oct 3, 2024

Regarding test6 failure, the logic in ConnectionPool.getResourceFromTransaction does not consider the 2nd database connection to be reused. This is caused by:
PoolTxHelper.isNonXAResourceInTransaction -> return resource.equals(tran.getNonXAResource());
The 1st connection resource is still 'busy' and this 2nd resource is not considered for reuse because tran.getNonXAResource() points to the 1st connection handle (which is still busy), this results in a new resource to be used used for each call in the for loop, and every one is added to the transaction and never released until the whole transaction is committed.
This is fine until 32 resources are used from the pool in the 100 loop.

Looking at the code comments, resource optimization was only ever implemented for the first resource in the transaction:

JavaEETransactionImpl -> "states object that provides optimized local transaction support when a transaction uses zero/one non-XA resource, and delegates to JTS otherwise."

and code confirms this behavior, on enlistment of a resource in the transaction only the 1st item is set as the nonXAResource, while the transaction can consist of more resources in the tran.getResources() collection:

JavaEETransactionManagerSimplified.enlistResource(Transaction tran, TransactionalResource h)
            if (tx.getNonXAResource() == null) {
                tx.setNonXAResource(h);
            }

So based on this code I think the test6 could be wrong: 100 calls are not supposed to work as long as connection 1 is still occupied. It would only work if connection1 could be reused.

If I change the test to

// Default transaction: Required -> new transaction is started
public boolean test6() throws Exception {
    // DISABLED conn1 = ds.getConnection();
    SimpleSession2 bean = home.create();
    for (int i = 0; i < 100; i++) {
        bean.test3();
    }
    conn1.close();
}

the test succeeds, bean.test3() now opens the first resource an can now reuse it due to the JavaEETransactionManagerSimplified logic and only 1 ResourceHandle is used.

====

When I try to lift the limitation resource optimization to apply to all resources in the transaction I looked at supporting getResourceFromTransaction for the second resource in the transaction.
I altered code:
PoolTxHelper.isNonXAResourceInTransaction -> return resource.equals(tran.getNonXAResource());
to look at all resources in the transaction:
PoolTxHelper.isNonXAResourceInTransaction -> return tran.getResources(poolInfo).contains(resource);

the logic in 'ConnectionPool.getResourceFromTransaction`

  • unaltered test6() then succeeds: 1st resource is busy, 2nd resource is reused by the getResourceFromTransaction code.

  • but now test4() fails with a database exception because of a database lock. Not sure yet if this is valid lock error from the database if connection 2 is used for both INSERT and UPDATE in one go.
    README of Scenario IV: -> insert some more data -> close c1,c2 -> call bean2 is not how it is implemented in the test.
    Changing test4 to do what the README says fixes test4. (close conn1, before conn2, and not close conn1 after conn2 in the final block)
    -- I thought test4 was not running correct -> update bean was performed outside of the transaction, but the test did not fail since it is a database update call which does not check if the record really exists and the test does not test how many records have been updated. But I tested it and rows are updated.

  • and test 5 fails.. which I do not yet understand, the test opens 4 connections, of which 3 remain open in parallel and expects 1 real datasource to be used in total, but I would expect 3 handles and three real connections to be used in on transaction. README does not describe the purpose of the test.
    Why should only 1 real connection be used? I assume because old logic forcefully unenlisted connection 2, leaving connection 1 and causing connection 3 and 4 to reuse connection 1.

====

I wrote earlier:
Without the fix: all 101 real database connection resources are the same, so eventually all 101 calls use the same database connection.
but those 100 connections where unenlisted from the transaction, and no other test is running so no problem.
But I think the fix is correct and that 'PoolTxHelper.isLocalResourceEligibleForReuse -> enforceDelistment' should not occur a 100 times in this test.

I also tried in 2018 to figure out how getConnection should work according to the Jakarta specification, because I saw different behavior in different application servers. I would love to see a pointer to the specification definitions explaining what to expect in the most basic REQUIRED transaction situation.
https://stackoverflow.com/questions/51630636/how-many-database-connections-from-a-pool-are-occupied-during-one-ejb-transactio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NPE in ConnectorXAResource.getResourceHandle causes connection pool problems
2 participants