Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Tasks get stuck in Queued status #5927

Closed
2 tasks done
sorushsaghari opened this issue Oct 28, 2024 · 14 comments
Closed
2 tasks done

[BUG] Tasks get stuck in Queued status #5927

sorushsaghari opened this issue Oct 28, 2024 · 14 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working

Comments

@sorushsaghari
Copy link

Describe the bug

Tasks in the Flyte deployment are not executing and remain in either an unknown or queued state indefinitely. No task progresses to the running or completed state, effectively halting workflow execution.

Expected behavior

Tasks should transition from the queued state to running, followed by completion, provided that no errors or resource constraints are encountered.

Additional context to reproduce

1- Set up Flyte using the provided Helm configuration.
2- Trigger a workflow that contains at least one task.
3- Observe that the tasks remain in queued status without progressing.

Helm configuration:

flyte-core-components:
  admin:
    disabled: false
    disableScheduler: false
    disableClusterResourceManager: false
    seedProjects:
      - <project-name>

  propeller:
    disabled: false
    disableWebhook: false
  dataCatalog:
    disabled: false

deployment:
  image:
    repository: <docker-registry-url>/flyte-binary-release
    tag: v1.13.3
  resources:
    limits:
      memory: 4Gi
      cpu: 3
    requests:
      memory: 4Gi
      cpu: 2
  waitForDB:
     image:
      repository: <docker-registry-url>/postgres

configuration:
  database:
    username: <db-username>
    password: <db-password>
    host: <db-host>
    port: 5432
    dbname: <db-name>
    options: sslmode=disable

  storage:
    metadataContainer: <meta-container>
    userDataContainer: <user-container>
    provider: s3
    providerConfig:
      s3:
        disableSSL: true
        v2Signing: true
        authType: accesskey
        accessKey: <s3-access-key>
        secretKey: <s3-secret-key>
        endpoint: "<s3-endpoint>"

  logging:
    show-source: true
    level: 15

  auth:
    enabled: false
  co-pilot:
    image:
      repository: <docker-registry-url>/flytecopilot
      tag: v1.13.3

service:
  type: ClusterIP

ingress:
  create: true
  host: <flyte-host-url>
  separateGrpcIngress: true

rbac:
  create: true
  extraRules:
    - apiGroups:
        - "ray.io"
      resources:
        - rayclusters
        - rayjobs
        - rayservices
      verbs:
        - "*"

serviceAccount:
  create: true

image

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@sorushsaghari sorushsaghari added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Oct 28, 2024
@davidmirror-ops
Copy link
Contributor

@sorushsaghari there must be a pod on the corresponding namespace (maybe flytesnacks-development) could you share the output of kubectl describe on that Pod?

@sorushsaghari
Copy link
Author

sorushsaghari commented Oct 29, 2024

@davidmirror-ops there is no pod in development, -staging, -production namespaces of the project

@davidmirror-ops
Copy link
Contributor

Got it, what about logs from the flyte-binary Pod?

@sorushsaghari
Copy link
Author

Got it, what about logs from the flyte-binary Pod?

heres the log file
binary.log

@davidmirror-ops
Copy link
Contributor

ok looks like you're using namespaces other than the default (totally fine). Could you find a pod in the corresponding namespace? (maybe run kubectl get pods -A to start with)

@sorushsaghari
Copy link
Author

sorushsaghari commented Oct 30, 2024

@davidmirror-ops i have aleready done this. and check any possible namepsaces. but i dont find any pod there .
my main problem is the logs. they are not descriptive and i cannot find the problem

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Oct 31, 2024
@nkwangleiGIT
Copy link

can you try 'kubectl get flyteworkflows -A' to see if any workflow created? and check the event of the workflow

@sorushsaghari
Copy link
Author

can you try 'kubectl get flyteworkflows -A' to see if any workflow created? and check the event of the workflow

image
nope. nothing found.

@sorushsaghari
Copy link
Author

ok now the state is like this:

(ks3/sre-stage) ➜  workflows git:(master) ✗ kubectl get flyteworkflows.flyte.lyft.com -A
NAMESPACE                  NAME                   AGE
sre-data-prj-development   f1acbcb2b8146415ea8c   3m10s
sre-data-prj-development   f7b02a1ea30f14ee88d9   2m25s
sre-data-prj-development   fa6fed9e06bc74caea4f   14s
sre-data-prj-development   fcca4e8642f3d4879996   22h
sre-data-prj-development   fd99a060dc9f248b68e3   2m39s
sre-data-prj-development   fdf1d10a7e13f459f8ce   2m19s

and describing one of them ends to this output:

(ks3/sre-stage) ➜  workflows git:(master) ✗ kubectl describe flyteworkflows.flyte.lyft.com -n sre-data-prj-development fa6fed9e06bc74caea4f
Name:         fa6fed9e06bc74caea4f
Namespace:    sre-data-prj-development
Labels:       completed-time=2024-11-04.10
              domain=development
              execution-id=fa6fed9e06bc74caea4f
              project=sre-data-prj
              shard-key=17
              termination-status=terminated
              workflow-name=workflows-hello-world-hello-world-wf
Annotations:  <none>
Accepted At:  2024-11-04T10:30:09Z
API Version:  flyte.lyft.com/v1alpha1
Execution Config:
  Environment Variables:  <nil>
  Interruptible:          <nil>
  Max Parallelism:        25
  Overwrite Cache:        false
  Recovery Execution:
  Task Plugin Impls:
  Task Resources:
    Limits:
      CPU:                2
      Ephemeral Storage:  0
      GPU:                1
      Memory:             1Gi
      Storage:            0
    Requests:
      CPU:                2
      Ephemeral Storage:  0
      GPU:                0
      Memory:             200Mi
      Storage:            0
Execution Id:
  Domain:   development
  Name:     fa6fed9e06bc74caea4f
  Project:  sre-data-prj
Inputs:
Kind:  FlyteWorkflow
Metadata:
  Creation Timestamp:  2024-11-04T10:30:09Z
  Generation:          7
  Resource Version:    9576863087
  UID:                 6cdb9b37-0d5b-4889-838d-18eb904094c4
Node - Defaults:
Raw Output Data Config:
Security Context:
  run_as:
Spec:
  Connections:
    n0:
      end-node
    Start - Node:
      n0
  Edges:
    Downstream:
      n0:
        end-node
      Start - Node:
        n0
    Upstream:
      End - Node:
        n0
      n0:
        start-node
  Id:  sre-data-prj:development:workflows.hello_world.hello_world_wf
  Nodes:
    End - Node:
      Id:  end-node
      Input Bindings:
        Binding:
          Promise:
            Node Id:  n0
            Var:      o0
        Var:          o0
      Kind:           end
      Resources:
    n0:
      Id:    n0
      Kind:  task
      Name:  say_hello
      Resources:
      Task:  resource_type:TASK  project:"sre-data-prj"  domain:"development"  name:"workflows.hello_world.say_hello"  version:"nZUSJ6vQx7lxnJ4AVgTyZg"
    Start - Node:
      Id:    start-node
      Kind:  start
      Resources:
  Output Bindings:
    Binding:
      Promise:
        Node Id:  n0
        Var:      o0
    Var:          o0
  Outputs:
    Variables:
      o0:
        Type:
          Simple:  STRING
Status:
  Data Dir:     s3://sredata-flyte-meta-dev/metadata/propeller/sre-data-prj-development-fa6fed9e06bc74caea4f
  Def Version:  1
  Error:
    Code:           ExecutionNotFound
    Kind:           SYSTEM
    Message:        Workflow execution not found in flyteadmin.
  Failed Attempts:  3
  Last Updated At:  2024-11-04T10:30:09Z
  Node Status:
    Start - Node:
  Phase:       5
  Started At:  2024-11-04T10:30:09Z
  Stopped At:  2024-11-04T10:30:09Z
Tasks:
  resource_type:TASK  project:"sre-data-prj"  domain:"development"  name:"workflows.hello_world.say_hello"  version:"nZUSJ6vQx7lxnJ4AVgTyZg":
    Container:
      Args:
        pyflyte-fast-execute
        --additional-distribution
        s3://sredata-flyte-meta-dev/sre-data-prj/development/JR3VN3BXRDVO2EFUMIUGKLRXZE======/script_mode.tar.gz
        --dest-dir
        .
        --
        pyflyte-execute
        --inputs
        {{.input}}
        --output-prefix
        {{.outputPrefix}}
        --raw-output-data-prefix
        {{.rawOutputDataPrefix}}
        --checkpoint-path
        {{.checkpointOutputPrefix}}
        --prev-checkpoint
        {{.prevCheckpointPrefix}}
        --resolver
        flytekit.core.python_auto_container.default_task_resolver
        --
        task-module
        workflows.hello_world
        task-name
        say_hello
      Image:  cr.flyte.org/flyteorg/flytekit:py3.10-1.13.4
      Resources:
        Limits:
          Name:   CPU
          Value:  2
          Name:   MEMORY
          Value:  200Mi
        Requests:
          Name:   CPU
          Value:  2
          Name:   MEMORY
          Value:  200Mi
    Id:
      Domain:         development
      Name:           workflows.hello_world.say_hello
      Project:        sre-data-prj
      Resource Type:  TASK
      Version:        nZUSJ6vQx7lxnJ4AVgTyZg
    Interface:
      Inputs:
      Outputs:
        Variables:
          o0:
            Type:
              Simple:  STRING
    Metadata:
      Retries:
      Runtime:
        Flavor:   python
        Type:     FLYTE_SDK
        Version:  1.13.4
    Type:         python-task
Workflow Meta:
  Event Version:  2
Events:           <none>

@sorushsaghari
Copy link
Author

do you guys have any idea about it ? @eapolinario @nkwangleiGIT

@nkwangleiGIT
Copy link

is there any pod under sre-data-prj-development namespace? check using kubectl get pod -n sre-data-prj-development
If not, I think we still need to check the log of flyte-binary or flytepropeller, to check if any error messages.

@eapolinario
Copy link
Contributor

@pvditt , can you give us some pointers here? @sorushsaghari is running single binary and we see the flyteworkflow CRD being created but no pods. What might cause this?

@pvditt pvditt self-assigned this Nov 7, 2024
@pvditt pvditt added the backlogged For internal use. Reserved for contributor team workflow. label Nov 7, 2024
@sorushsaghari
Copy link
Author

Hi everyone I managed to fix it. there was an error in the workflow manifest. I had two flyte deployments in two different namespaces. and it caused some conflicts. I couldn't understand why. but after deleting the other one it worked.

@pvditt
Copy link
Contributor

pvditt commented Nov 12, 2024

@sorushsaghari thank you for the follow up. Please let us know if you run into any other issues.

@pvditt pvditt closed this as completed Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants