Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beta dashboard: only the first page of the projects list is accessible #11061

Closed
webknjaz opened this issue Jan 24, 2024 · 14 comments · Fixed by readthedocs/ext-theme#272
Closed
Assignees
Labels
Accepted Accepted issue on our roadmap Bug A bug

Comments

@webknjaz
Copy link
Contributor

Details

Expected Result

I want to see the second page of projects on the beta host.

Actual Result

https://beta.readthedocs.org/dashboard/ works, but clicking on the [2] at the bottom that attempts to open https://beta.readthedocs.org/dashboard/?page=2 gets stuck for half a minute, and then, it shows CloudFlare's 502: Bad gateway web page. This happens consistently so I suppose there's, an unhandled exception or a timeout somewhere in the new beta website.

@humitos humitos changed the title [bug][beta.readthedocs.org] Only the first page of the projects list is accessible in the dashboard Beta dashboard: only the first page of the projects list is accessible Jan 24, 2024
@humitos humitos added the Needed: replication Bug replication is required label Jan 24, 2024
@humitos
Copy link
Member

humitos commented Jan 24, 2024

Hrm, interesting. I wasn't able to reproduce this issue while logged in with my user. I thought it could be because you could have ton of projects, but I found that we have a similar amount. So, I'm discarding that.

I know we were facing some performance issues on this particular dashboard page (readthedocs/ext-theme#128), but I'm not sure why it could affect your user but not mine. cc @agjohnson any clue?

@webknjaz
Copy link
Contributor Author

Just checked from phone once again. It's still happening:
Screenshot_2024-01-24-17-23-01-93_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

I suppose you don't have any impersonation capabilities in the web app to see it as me?

Since it's behind the CloudFlare CDN, I think I'm hitting different CF nodes than you do. But then, (some) other places in the dashboard work.

I don't know how it's deployed, but is it possible I'm routed to some other backend server? Maybe stick Sentry there?

@agjohnson
Copy link
Contributor

agjohnson commented Jan 24, 2024

I have actually noticed something similar with a couple of projects and the build list (loading the build list for one project gave a 5xx and for another project with more builds did not). I think there is something in both cases that seems to be causing a timeout on the application side. We should look into exceptions and logging probably.

@stsewd
Copy link
Member

stsewd commented Jan 24, 2024

@webknjaz
Copy link
Contributor Author

@stsewd FYI Sentry has a feature of sharing tracebacks. That would give you a link letting others see what's on the page. Not all, just a reduced portion without any sensitive data. That's useful for sharing publicly.

@humitos
Copy link
Member

humitos commented Jan 26, 2024

This is the sentry issue for the 500 when accessing a user profile

This seems to be a different error since it's a 500, and they are getting a 502.

This 500 error seems to be related to Build objects that are not attached to a particular Version for some weird reason. Well, I just found we have on_delete=SET_NULL here, which doesn't make too much sense to me, but I understand we may wanted to keep the "history of builds", but it's weird:

version = models.ForeignKey(
Version,
verbose_name=_('Version'),
null=True,
related_name='builds',
on_delete=models.SET_NULL,
)

In the templates, we are making the assumption that all the builds are attached to a version, when it's not true due to the SET_NULL cc @agjohnson

@agjohnson
Copy link
Contributor

Hrm yeah, that would explain the error I hit. Maybe I'm not recalling the error and it was a 500 error, not a timeout.

That is maybe a little tricky to solve, as the build list UI is relying more on the relationships between build/version -- in filtering, listing by version, etc. I'll have to think more on a fix there.

But yeah, that does seem like a different error than the timeout here. I'm guessing NR might have some information on the long transaction.

@stsewd
Copy link
Member

stsewd commented Jan 29, 2024

We set to null to keep the history of builds, yes. We populate some fields from the version into the build for any other operation we may need

if self.version:
self.version_name = self.version.verbose_name
self.version_slug = self.version.slug
self.version_type = self.version.type

@agjohnson
Copy link
Contributor

agjohnson commented Jan 31, 2024

I was having trouble tracking down any worthwhile information in New Relic. I think I see some of the transactions, and I see the build list SELECT taking a few seconds, but that's about it.

So I configured a build with a null version like @stsewd described and was able to reproduce both a 500 on the user profile, but also what would probably be a timeout on the dashboard project listing (response time was more the 10s). I did not see a long database select time here though.

While profiling this locally, I only noticed the slow response time that I've noted with some template errors. I'm not convinced that this would affect production, but I also can't say for sure that it doesn't.

image

agjohnson added a commit to readthedocs/ext-theme that referenced this issue Jan 31, 2024
When versions are removed, the builds remain with a `build.version = None`,
along with some additional attributes on the Build instance (for example
`build.version_name`).

This bug affected any display that listed projects/versions that listed
the most recent build.

Fixes readthedocs/readthedocs.org#11061
@agjohnson agjohnson self-assigned this Jan 31, 2024
@agjohnson agjohnson added Bug A bug Accepted Accepted issue on our roadmap and removed Needed: replication Bug replication is required labels Jan 31, 2024
@agjohnson
Copy link
Contributor

agjohnson commented Jan 31, 2024

So, I believe I fixed the underlying template bug triggering this behavior in:

But I think the server response might be instead this:

I say this because I can't see a reason why the PR above should fix a 502 response. I also noticed the same problem with inflated SQL queries: db time is ~1000ms without the PR above and 43ms with the fix. Also also, the offending template code was also in an include, like #10954.

If the PR above fixes the 502 response, the template bug I notice while working on templates might very well exist in production as well.

agjohnson added a commit to readthedocs/ext-theme that referenced this issue Jan 31, 2024
* Use build.version_name first, in case build.version is None

When versions are removed, the builds remain with a `build.version = None`,
along with some additional attributes on the Build instance (for example
`build.version_name`).

This bug affected any display that listed projects/versions that listed
the most recent build.

Fixes readthedocs/readthedocs.org#11061

* Use get_version_name method instead of template logic
@webknjaz
Copy link
Contributor Author

@agjohnson has it been deployed? I just checked again and I'm seeing exactly the same bug with HTTP 502. Nothing changed for me.

@agjohnson
Copy link
Contributor

@webknjaz It was not yet released, but I just pushed out a somewhat hacky release. The full release will go out on Tues.

I can load your profile page now. Are you able to load your dashboard page 2?

@webknjaz
Copy link
Contributor Author

webknjaz commented Feb 1, 2024

Yes, I can see both profile and dashboard. The pagination also works on both. Thanks!

@agjohnson
Copy link
Contributor

Great, I'm glad that fixed the issue!

I think that confirms that while the actual bug here was a minor template error, the underlying behavior you noticed with a timeout is actually a bug with the application exception handling or logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Bug A bug
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants