scene8.html


<!DOCTYPE html>
<html>

  <head lang="en">
    <meta charset="UTF-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">

    <title>Physically Grounded VLMs</title>

    <meta name="description" content="Physically Grounded VLMs">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <!-- <base href="/"> -->
    <!--     <link rel="apple-touch-icon" href="apple-touch-icon.png"> -->
    <!-- <link rel="icon" type="image/png" href="img/seal_icon.png"> -->
    <!-- Place favicon.ico in the root directory -->

    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.8.0/codemirror.min.css">
    <link rel="stylesheet" href="css/app.css">

    <link rel="stylesheet" href="css/bootstrap.min.css">

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.8.0/codemirror.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/1.5.3/clipboard.min.js"></script>
    
    <script src="js/app.js"></script>
    <!-- Google tag (gtag.js) -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=G-52J0PM8XKV"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js', new Date());

      gtag('config', 'G-52J0PM8XKV');
    </script>
    
    <style>
      .nav-pills {
          position: relative;
          display: inline;
      }
      .imtip {
          position: absolute;
          top: 0;
          left: 0;
      }
    </style>
  </head>

  <body>
    <div class="topnav">
      <a href="index.html">Home</a>
      <div class="rightnav">
	<a href="scene1.html">Scene 1: Countertop</a>
	<a href="scene2.html">Scene 2: Art Table</a>
	<a href="scene3.html">Scene 3: Floor</a>
	<a href="scene4.html">Scene 4: Kitchen A</a>
	<a href="scene5.html">Scene 5: Kitchen B</a>
	<a href="scene6.html">Scene 6: Salad Bar</a>
	<a href="scene7.html">Scene 7: Living Room</a>
	<a href="scene8.html" class="active">Scene 8: Shelf</a>
	<a href="RS1/index.html">Robot Scene 1</a>
	<a href="RS2/index.html">Robot Scene 2</a>
      </div>
    </div>
    <div class="container" id="main">
      

      <div class="row">
        <div class="col-md-8 col-md-offset-2">
          <br>
          <h1>
			Scene 8: Shelf
          </h1>
	  <div class="irow">
	    <div class="strict_column">
	      <img src="img/eval2_scene8.png">
	    </div>	    
	    <div class="strict_column">
			<p> A: box [binder] </p>
			<p> B: bottle [large plastic tub] </p>
			<p> C: bottle [plastic bottle] </p>
			<p> D: box [algorithms textbook] </p>
			<p> E: pitcher (container) [blue metal cup] </p>
			<p> F: water glass [small glass cup] </p>
			<p style="color: #ff00ff"> G: headphones [phone cable] </p>
			<p> H: dumbbell [power brick] </p>
			<p style="color: #9900ff"> I: adhesive tape [ruler] </p>
	    </div>

	  </div>
	  <br>
	  <p class="text-justify">
	    Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in <span style="color: #ff9900">orange</span>, and we mark correct and incorrect plans with <span style="color: #38761d">green</span> and <span style="color: #ff0000">red</span> respectively.
	  </p>
	  
	  <h2><b>Task 48: Bring me the most bendable object.</b></h2>

	  <p class="text-justify">
	    In this task, we want to determine if the VLM understands which objects are bendable (which is a synonym for deformable). The best object in this scene is <span style="color: #ff00ff"> G </span>, which is a cable to charge devices.
	  </p>

	  <h3>Generated Plans</h3>

	  <div class="irow">
	    <div class="tcolumn">
	      <h3> No VLM </h3>
	      <p> Thought: The most bendable object among the given objects is the adhesive tape. </p>
	      <p> Plan: </p>
	      <p style="color: #ff0000">
			1. Go to object I <br>
			2. Pick up object I <br>
			3. Bring to human object I <br>
			4. Done</p>
	    </div>
	    
	    <div class="tcolumn">
	      <h3> InstructBLIP </h3>
	      <p> Thought: I can ask which object is bendable out of all the objects, and I can find the one with the largest Yes value. </p>
	      <p> Question about [A, B, C, D, E, F, G, H, I]: Is this object bendable? </p>
	      <p style="color: #ff9900"> Answer <br>
			A: Yes (0.53), No (0.42), Unknown (0.03) <br>
			B: No (0.65), Yes (0.29), Unknown (0.03) <br>
			C: No (0.72), Yes (0.23), Unknown (0.03) <br>
			D: No (0.6), Yes (0.36), Unknown (0.02) <br>
			E: No (0.85), Yes (0.11), Unknown (0.02) <br>
			F: No (0.62), Yes (0.29), Unknown (0.05) <br>
			G: Yes (0.63), No (0.33), Unknown (0.02) <br>
			H: No (0.51), Yes (0.42), Unknown (0.04) <br>
			I: No (0.53), Yes (0.44), Unknown (0.01)
	      </p>
	      <p> Thought: The most bendable object is the one with the highest Yes value or the lowest No value (if none are Yes). The most bendable object is G (headphones) with the highest Yes value. </p>
	      <p> Plan: </p>
	      <p style="color: #38761d">
			1. Go to object G <br>
			2. Pick up object G <br>
			3. Bring to human object G <br>
			4. Done
	      </p>
	    </div>

	    <div class="tcolumn">
	      <h3> PG-InstructBLIP (ours) </h3>
	      <p> Thought: I can ask which object is bendable out of all the objects, and I can find the one with the largest Yes value. </p>
	      <p> Question about [A, B, C, D, E, F, G, H, I]: Is this object bendable? </p>
	      <p style="color: #ff9900"> Answer <br>
			A: Yes (0.6), No (0.38), Unknown (0.01) <br>
			B: No (0.7), Yes (0.26), Unknown (0.02) <br>
			C: No (0.62), Yes (0.35), Unknown (0.02) <br>
			D: No (0.52), Yes (0.45), Unknown (0.02) <br>
			E: No (0.95), Yes (0.03), Unknown (0.0) <br>
			F: No (0.86), Yes (0.11), Unknown (0.01) <br>
			G: Yes (0.98), No (0.02) <br>
			H: No (0.73), Yes (0.24), Unknown (0.02) <br>
			I: Yes (0.56), No (0.42), Unknown (0.01) <br>
	      </p>
	      <p> Thought: The most bendable object is the one with the highest Yes value. The highest Yes value is for object G (headphones). </p>
	      <p> Plan: </p>
	      <p style="color: #38761d">
			1. Go to object G <br>
			2. Pick up object G <br>
			3. Bring to human object G <br>
			4. Done
	      </p>
	    </div>
	  </div>
	</div>
      </div>
  </body>
</html>