Integration between YOLO pose and Multi-view Geometry for 3D information #18822

egleorl1 · 2025-01-22T10:28:27Z

egleorl1
Jan 22, 2025

Hi! I trained a model for keypoint tracking of a robotic arm using YOLOv8 for pose estimation. I combined my custom-trained model with another model for human pose estimation (since in my setup, a human collaborates with a robotic arm during an assembly process).

I would like to extract the time series data related to the movements of both the robot and the human to evaluate their coordination. Nevertheless, using YOLO, I can only extract 2D information (xy coordinates of each keypoint). From the related issues I’ve read, I saw that a good solution to add depth (z) information to each keypoint, transitioning from 2D to 3D, is through multi-view geometry and triangulation.

I do have indeed access to two cameras and would like to ask for your help to understand how to best integrate YOLO pose estimation with multi-view geometry to obtain xyz information for each keypoint. A workflow explanation and code examples would be greatly appreciated :)

UltralyticsAssistant · 2025-01-22T10:29:05Z

UltralyticsAssistant
Jan 22, 2025
Maintainer

👋 Hello @egleorl1, thank you for sharing your interesting work with YOLOv8 🚀! Your project involving keypoint tracking of a robotic arm and human pose estimation sounds fascinating and impactful.

While we can't provide a step-by-step solution here, I'd be happy to point you in the right direction and summarize the tools and resources available to you.

If this is a ❓ Custom Workflow Question, please provide additional details if possible, such as datasets, logs, or any relevant scripts you’ve already tried. This ensures we can better understand your setup. For code-heavy topics, sharing a minimum reproducible example (MRE) could help us assist you more effectively.

If you're trying to implement 3D pose estimation using multi-view geometry and YOLO:

Make sure you've reviewed YOLO Pose Estimation Docs to understand native pose capabilities in YOLOv8.
With two or more cameras, triangulating keypoint data into 3D coordinates typically involves leveraging libraries like OpenCV or using pre-built frameworks like COLMAP. For Python examples, opencv triangulation examples may be helpful.
Consider preparing synchronized 2D keypoint coordinates from both cameras while ensuring their respective intrinsic and extrinsic parameters are known. These parameters enable accurate 2D-to-3D translations like with stereo vision pipelines.

For immediate experimentation, ensure your ultralytics package is updated to the latest version, along with its required dependencies. You can upgrade it using:

pip install -U ultralytics

For real-time community conversations and feedback, I recommend joining our Discord 🎧. Additionally, you can explore our Subreddit or Discourse Forum for longer-form discussions. Sharing your setup and asking for integration tips there might bring insights from others tackling similar problems.

You’re also encouraged to test YOLO in verified cloud environments. These environments are preconfigured with dependencies to help accelerate experimentation:

Notebooks: Quickly test your setups in , , or for free GPU-powered experiments.

Our CI tests also validate YOLO's 🟢 stability across various operating systems (Ultralytics CI) to ensure seamless usage for your development.

Lastly, this is an automated response 🤖, but rest assured that one of our engineers will review your discussion soon to provide further insights. Thank you for being part of the Ultralytics community!

0 replies

glenn-jocher · 2025-01-22T16:00:02Z

glenn-jocher
Jan 22, 2025
Maintainer

@egleorl1 thank you for your question regarding integrating YOLO pose estimation with multi-view geometry for 3D keypoint reconstruction. While YOLO models provide accurate 2D keypoint detection, transitioning to 3D requires triangulation using synchronized multi-camera setups and corresponding camera calibration data (intrinsics and extrinsics).

You'll need to:

Calibrate your cameras to obtain their intrinsic and extrinsic parameters.
Use YOLO pose outputs to detect 2D keypoints from both camera views.
Apply triangulation (e.g., using OpenCV's cv2.triangulatePoints) to calculate the 3D positions.

For implementation, refer to OpenCV's documentation on camera calibration and triangulation. While Ultralytics does not directly support 3D reconstruction, combining YOLO's 2D outputs with such tools is a well-established approach. Let me know if you need further clarification.

0 replies

egleorl1 · 2025-01-24T11:58:06Z

egleorl1
Jan 24, 2025
Author

@glenn-jocher thank you very much much for your answer!
I take the opportunity to ask you another piece of information, it has come to my attention that we own a Real Sense D415 3D camera, potentially this could be a better way to go than multi-view geometry? How can we integrate this 3D camera with YOLO?

1 reply

glenn-jocher Jan 24, 2025
Maintainer

@egleorl1 you're welcome! Yes, using a RealSense D415 3D camera can simplify the process by providing direct depth information. You can combine YOLO detections on RGB images with the depth data from the camera to estimate 3D keypoints. Ensure proper RGB and depth alignment, as RealSense cameras provide synchronized data. For integration in ROS, refer to this guide: https://docs.ultralytics.com/guides/ros-quickstart/.

egleorl1 · 2025-01-24T18:41:50Z

egleorl1
Jan 24, 2025
Author

Sounds great! 😊 Could you please clarify why the ROS system is necessary for integrating YOLO with depth images? I’m quite new to computer vision and still having trouble visualizing the entire process.

The final output I’m hoping for is to use my custom YOLO pose model on video captured by the RealSense, which, as you mentioned, synchronizes RGB and depth information. Then, I would like to save the coordinates extracted by YOLO (e.g., keypoints_normalized = r.keypoints.xyn.tolist()) and integrate them with the depth map to obtain z values for each keypoint. Is there an example of code I can customize for this purpose?

9 replies

glenn-jocher Jan 28, 2025
Maintainer

@egleorl1 your scripts look great and well-structured for capturing RGB and depth data, processing frames with YOLO, and extracting 3D keypoints! Regarding your questions:

Interpreting Output: Each tuple (x, y, z) represents the 3D coordinates of a detected keypoint. Here:
- x and y: Pixel positions of the keypoint in the RGB image.
- z: Depth value (distance from the camera in millimeters, depending on your RealSense depth configuration). A z value of 0 indicates no valid depth data was available for that keypoint. You can use these coordinates to evaluate spatial relationships, such as distances between keypoints or trajectories over time.
Using a Second YOLO Model: Yes, you can integrate a second YOLO model for human arm and hand tracking. Load both models in your script, process the same RGB frames with each model, and extract their respective keypoints. Then, combine these 2D keypoints with the depth map to compute 3D coordinates for both the robot and human. Here's a minimal example for handling multiple YOLO models:

# Load both YOLO models
robot_model = YOLO("robot_pose_model.pt")
human_model = YOLO("human_pose_model.pt")

# Process frame with both models
robot_results = robot_model(rgb_frame)
human_results = human_model(rgb_frame)

# Extract and align 3D keypoints for robot and human (as in your script)
# Combine results for further analysis

This workflow allows synchronized tracking of both entities. Let me know if you need further clarification or specific adjustments!

egleorl1 Jan 31, 2025
Author

Thank you very much for your advice! I have a few final doubts, hoping that everything will work correctly after this. 🙂

Is there any guideline for positioning depth cameras, or any specific calibration or transformation that should be applied? I was thinking that, when using a frontal view for example, the depth value (the distance from the camera) might change even if the object moves along the X-axis, while actually remaining at the same depth plane. Is this correct? If so, how can this "distortion" be mitigated?

Finally, when you mentioned: "You can use these coordinates to evaluate spatial relationships, such as distances between keypoints or trajectories over time." Could you give me an idea of the transformation to apply and whether it should be applied to raw keypoints or to normalized keypoints?

Thanks again for your support! 😊

glenn-jocher Jan 31, 2025
Maintainer

@egleorl1 your depth camera setup looks good! For calibration, ensure RGB-depth alignment (RealSense handles this internally) and consider lens distortion using OpenCV's calibration tools if needed. Depth values (z) represent distance from the camera plane, so lateral (x-axis) movements won't affect z if objects stay in the same depth plane.

For spatial relationships, use raw keypoints with depth values and camera intrinsics (fx/fy, cx/cy) to convert pixels to real-world coordinates. The keypoints.xyn normalized coordinates (0-1 range) are best for relative comparisons, while raw keypoints.xy + depth are for metric measurements. The ROS Quickstart guide shows coordinate transformations with depth data.

egleorl1 Feb 5, 2025
Author

@glenn-jocher Thanks! I added the function align = rs.align(rs.stream.color) to ensure RGB-depth alignment, as I wasn't sure if my previous script accounted for this? I also checked the camera distortion indices, and they are all set to 0.
Below is the modified version of the script for combining YOLO keypoints with depth maps to extract 3D coordinates (both for cobot and for human):

import cv2
import numpy as np
from ultralytics import YOLO
import os
import pandas as pd

# Camera intrinsic parameters
fx, fy = 616.733154296875, 616.3325805664062
cx, cy = 418.6494445800781, 229.25515747070312

# Depth values are in millimeters, converting to meters
DEPTH_SCALE = 1000.0  # Convert mm to meters

# Image directories
rgb_image_dir = "path/to/rgb/images"
depth_image_dir = "path/to/depth/images"

# Load YOLO models
model_original = YOLO("path/to/yolo/cobot_model.pt")
model_pose = YOLO("yolov8n-pose.pt")  # Human pose detection model

# Output directory
output_dir = "annotated_frames"
os.makedirs(output_dir, exist_ok=True)

frame_count = 0
keypoints_3d_all = []

# Process images one by one
for frame_file in sorted(os.listdir(rgb_image_dir)):
    if frame_file.startswith("rgb_frame_") and frame_file.endswith(".png"):
        rgb_path = os.path.join(rgb_image_dir, frame_file)
        depth_path = rgb_path.replace("rgb_frame_", "depth_frame_").replace(".png", ".npy")

        rgb_frame = cv2.imread(rgb_path)
        depth_map = np.load(depth_path)  # Load depth map

        if rgb_frame is None or depth_map is None:
            print(f"Error loading {rgb_path} or {depth_path}")
            continue

        # Apply both YOLO models
        results_original = model_original(rgb_frame)
        results_pose = model_pose(rgb_frame)

        # Extract keypoints and match depth
        for model_name, results in zip(["Cobot", "Human"], [results_original, results_pose]):
            if results and results[0].keypoints is not None:
                keypoints = results[0].keypoints.xy.cpu().numpy()
                keypoints = keypoints[0]  # First object

                # Validate keypoints before extracting depth
                z_values = []
                for x, y in keypoints:
                    if 0 <= int(y) < depth_map.shape[0] and 0 <= int(x) < depth_map.shape[1]:
                        z_values.append(depth_map[int(y), int(x)] / DEPTH_SCALE)  # Convert depth from mm to meters
                    else:
                        z_values.append(0)  # Zero depth for out-of-range keypoints

                # Inverse projection to obtain real-world coordinates (X, Y, Z)
                keypoints_real = []
                for (x, y), z in zip(keypoints, z_values):
                    X = (x - cx) * z / fx
                    Y = (y - cy) * z / fy
                    keypoints_real.append((X, Y, z))

                print(f"Frame {frame_count} - {model_name}: Real 3D keypoints: {keypoints_real}")
                keypoints_3d_all.append({"frame": frame_count, "model": model_name, "keypoints_3d": keypoints_real})

                # Draw keypoints on the RGB frame
                for (x, y), z in zip(keypoints, z_values):
                    cv2.circle(rgb_frame, (int(x), int(y)), 5, (0, 255, 0), -1)
                    cv2.putText(rgb_frame, f"Z: {z:.2f}m", (int(x), int(y) - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1)
            else:
                print(f"Frame {frame_count} - {model_name}: No keypoints detected.")

        # Save the annotated frame
        output_path = os.path.join(output_dir, f"annotated_{frame_count}.png")
        cv2.imwrite(output_path, rgb_frame)

        frame_count += 1

# Save 3D keypoints to a CSV file
df = pd.DataFrame(keypoints_3d_all)
df.to_csv("keypoints_3d_from_images.csv", index=False)

cv2.destroyAllWindows()
print("Processing complete. 3D keypoints saved in 'keypoints_3d_from_images.csv'.")

I have added the camera intrinsic parameters and the deprojection—does everything look correct? I am particularly unsure whether the intrinsic parameters I used are correct. I extracted both the RGB and depth intrinsics, and in the current script, I used the RGB ones. Is this the right approach? Thanks again :)

glenn-jocher Feb 5, 2025
Maintainer

Your script looks well-structured! Since you've aligned depth to RGB using rs.align(), using RGB intrinsics is indeed correct as depth frames now match RGB perspective. For intrinsic validation, consider cross-checking with RealSense SDK tools or calibration patterns. The deprojection logic appears accurate - you're correctly converting pixel coordinates to 3D space using depth and camera intrinsics.

For license compliance, remember commercial use requires an Enterprise License. Let us know if you encounter any specific data alignment issues during testing! 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ultralytics

Integration between YOLO pose and Multi-view Geometry for 3D information #18822

{{title}}

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ultralytics

Integration between YOLO pose and Multi-view Geometry for 3D information #18822

egleorl1 Jan 22, 2025

Replies: 4 comments · 10 replies

UltralyticsAssistant Jan 22, 2025 Maintainer

glenn-jocher Jan 22, 2025 Maintainer

egleorl1 Jan 24, 2025 Author

glenn-jocher Jan 24, 2025 Maintainer

egleorl1 Jan 24, 2025 Author

glenn-jocher Jan 28, 2025 Maintainer

egleorl1 Jan 31, 2025 Author

glenn-jocher Jan 31, 2025 Maintainer

egleorl1 Feb 5, 2025 Author

glenn-jocher Feb 5, 2025 Maintainer

egleorl1
Jan 22, 2025

Replies: 4 comments 10 replies

UltralyticsAssistant
Jan 22, 2025
Maintainer

glenn-jocher
Jan 22, 2025
Maintainer

egleorl1
Jan 24, 2025
Author

glenn-jocher Jan 24, 2025
Maintainer

egleorl1
Jan 24, 2025
Author

glenn-jocher Jan 28, 2025
Maintainer

egleorl1 Jan 31, 2025
Author

glenn-jocher Jan 31, 2025
Maintainer

egleorl1 Feb 5, 2025
Author

glenn-jocher Feb 5, 2025
Maintainer