Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: --reporting-period flag fails with timescaledb-parallel-copy 0.6.0+ #96

Open
Dev10000 opened this issue Nov 29, 2024 · 3 comments

Comments

@Dev10000
Copy link

Description:

The --reporting-period flag, which worked correctly in versions 0.5.1 and earlier, no longer functions as expected in versions 0.6.0 and above. Additionally, the tool fails to provide useful SQL error messages altogether since version 0.7.0.

The expected behaviour is that the --reporting-period flag should report intermediate insert stats, and any database errors (e.g., nonexistent database) should return clear SQL error messages.

timescaledb-parallel-copy (linux amd64) 0.7.0 - both period flag and sql errors don't work
timescaledb-parallel-copy (linux amd64) 0.6.0 - reporting-period flag doesn't work
timescaledb-parallel-copy (linux amd64) 0.5.1 - all works no issues


Steps to Reproduce:

  1. go install github.com/timescale/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy@latest
  2. Run timescaledb-parallel-copy with the --reporting-period flag on version 0.6.0 or later:
    timescaledb-parallel-copy --db-name <dbname> --reporting-period 2s  ...
  3. Use a nonexistent database or another scenario where a clear SQL error is expected.
  4. Observe the behavior and compare it with version 0.5.1 or 0.4.0.

Expected Behavior:

  • The --reporting-period flag should work as it did in version 0.5.1. and report intermediate insert stats.
  • SQL errors (e.g., nonexistent database) should result in clear error messages, such as:
    panic: could not connect: failed to connect to `host=10.101.6.67 user=postgres database=insert-time-erro`: server error (FATAL: database "insert-time-erro" does not exist (SQLSTATE 3D000))
    

Actual Behavior:

  1. Versions 0.6.0+:

    • --reporting-period flag does not function.
    • SQL errors are reported as expected.
  2. Versions 0.7.0+:

    • --period flag still does not function.
    • SQL errors fail to report, and the tool instead produces Go runtime errors:
      fatal error: all goroutines are asleep - deadlock!
      

Environment:

  • Tool Versions Tested:
    • 0.4.0: Works correctly
    • 0.6.0: Regression begins (--reporting-period fails, SQL errors still reported)
    • 0.7.0: Regression worsens (--reporting-period fails, SQL errors not reported)
    • 0.8.0: Same issues as 0.7.0
  • OS: Linux (amd64)
  • Database: PostgreSQL 17, Timescale 2.17
  • go version: go1.23.2 linux/amd64

Logs/Stack Trace:

From version 0.7.0:

fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc00007e600?)
        /usr/local/go/src/runtime/sema.go:71 +0x25
sync.(*WaitGroup).Wait(0x81c530?)
        /usr/local/go/src/sync/waitgroup.go:118 +0x48
github.com/timescale/timescaledb-parallel-copy/pkg/csvcopy.(*Copier).Copy(0xc0000141e0, {0x81c530, 0xa50200}, {0x818c00, 0xc00005c0f0})
        /home/john/go/pkg/mod/github.com/timescale/[email protected]/pkg/csvcopy/csvcopy.go:208 +0x4a9
main.main()
        /home/john/go/pkg/mod/github.com/timescale/[email protected]/cmd/timescaledb-parallel-copy/main.go:136 +0x525

Impact:

  • The regression significantly impacts workflows that rely on the --reporting-period flag.
  • Troubleshooting is hindered because SQL errors are not clearly reported, especially in later versions.

Additional Context:

  • Version 0.5.1 is the last version where everything functions as expected.
  • The issue seems to worsen with later versions, suggesting potential changes in error handling or the handling of the --period flag.

Suggested Fix:

  • Investigate changes in the --reporting-period flag implementation since version 0.6.0.
  • Ensure SQL error handling is consistent and reports clear messages, even in cases of database connection issues.

@MetalBlueberry
Copy link
Contributor

Thank you for the detailed explanation. It really helps to reproduce the issue and understand your situation.

I want to first point out that we already noticed the bug with progress report that was introduced on v0.6.0. There is a fix for it that was released on version v0.8.1-rc.1 with the PR #93

That version is still not released officially as we are still testing some other issues. In particular one that was reported last week due to a dead lock when the option implemented in the previous PR is used.

About the error when connecting to a database. The dead lock was also identified and fixed here #91 that is released on the same version mentioned above.
After that fix you should get meaningful errors when something goes wrong.

I'll fix the race condition and then all the functionality will be back to expected behavior.

@Dev10000
Copy link
Author

Dev10000 commented Dec 2, 2024

That's great to hear, as we are planning to use this in production for data ingestion.

My colleague has encountered some deadlock problems as well. Initially, we thought these were related to timescale itself, specifically upserts into compressed data as there were some deadlock issues regarding the compression, but it seems that the issue may have been fixed in timescale, maybe 2.16 version not sure.

Even on the latest timescale 2.17 version the timescaledb-parallel-copy tool still causes occasional deadlocks for my colleague, although I'm not sure which parallel-copy version. Due to the deadlock issues, we can't use parallel copy in production.

Do you know if parallel-copy is commonly used in production environments? I'm a little worried about these regression issues potentially affecting our production setup.

@MetalBlueberry
Copy link
Contributor

We started to use timescaledb-parallel-copy as a package on production recently for timescale cloud. This powers the feature to import csv files from the web UI. We added a couple features we needed, mainly around error reporting and package interaction. Those created the some deadlocks that we are aiming to resolve soon.

We are eager to make some improvements to the package interface so we can take it to a stable v1 version. Mainly about having a clean interface. But also about extending test coverage, so we do not have regressions.

Would you like to share some details on how you plan to use this on production? You may be able to use our API to import CSV data if you are using timescale cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants