Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valgrind: use of uninitialized value #13047

Closed
abkein opened this issue Jan 22, 2025 · 4 comments · Fixed by #13056
Closed

Valgrind: use of uninitialized value #13047

abkein opened this issue Jan 22, 2025 · 4 comments · Fixed by #13056

Comments

@abkein
Copy link

abkein commented Jan 22, 2025

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.6 from https://www.open-mpi.org/software/ompi/v4.1/

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  • tar xvf openmpi-4.1.6.tar.gz
  • cd openmpi-4.1.6 && mkdir build && cd build
  • ./configure --prefix=/special/place/for/install --enable-debug --enable-debug-symbols --with-pmi
  • make -j 16
  • make install
  • export MPI_HOME=/special/place/for/install

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

No

Please describe the system on which you are running

  • Operating system/version: SLES 12.3
  • Computer hardware: AMD EPYC 7351P 16-Core Processor
  • Network type: InfiniBand ports

Details of the problem

I use this code

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);

    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size);

    MPI_Finalize();
    return 0;
}

Compile

mpicc -g -o hello_world hello_world.c

Run

sbatch --partition=test --ntasks-per-node=1 --wrap "srun -u valgrind --track-origins=yes ./hello_world"

Output:

==31946== Memcheck, a memory error detector
==31946== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==31946== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==31946== Command: ./hello_world
==31946== 
==31946== Conditional jump or move depends on uninitialised value(s)
==31946==    at 0xFDA5CAF: init_one_device (btl_openib_component.c:1956)
==31946==    by 0xFDA7F24: btl_openib_component_init (btl_openib_component.c:2880)
==31946==    by 0x5B9BCFB: mca_btl_base_select (btl_base_select.c:110)
==31946==    by 0xF76F53A: mca_bml_r2_component_init (bml_r2_component.c:86)
==31946==    by 0x4F227A5: mca_bml_base_init (bml_base_init.c:74)
==31946==    by 0x4F8898C: ompi_mpi_init (ompi_mpi_init.c:613)
==31946==    by 0x4EE88C1: PMPI_Init (pinit.c:67)
==31946==    by 0x4008D2: main (hello_world.c:5)
==31946==  Uninitialised value was created by a stack allocation
==31946==    at 0xFDB671F: parse_file (btl_openib_ini.c:221)
==31946== 
Hello world from processor host18, rank 0 out of 1 processors
==31946== Conditional jump or move depends on uninitialised value(s)
==31946==    at 0xFD9D32D: mca_btl_openib_finalize_resources (btl_openib.c:1715)
==31946==    by 0xFD9D4F6: mca_btl_openib_finalize (btl_openib.c:1743)
==31946==    by 0x5B9B296: mca_btl_base_close (btl_base_frame.c:203)
==31946==    by 0x5B7EB9F: mca_base_framework_close (mca_base_framework.c:216)
==31946==    by 0x4F22926: mca_bml_base_close (bml_base_frame.c:130)
==31946==    by 0x5B7EB9F: mca_base_framework_close (mca_base_framework.c:216)
==31946==    by 0x4E9CA4A: ompi_mpi_finalize (ompi_mpi_finalize.c:449)
==31946==    by 0x4EDB0EC: PMPI_Finalize (pfinalize.c:54)
==31946==    by 0x400931: main (hello_world.c:20)
==31946==  Uninitialised value was created by a stack allocation
==31946==    at 0xFDB671F: parse_file (btl_openib_ini.c:221)
==31946== 
==31946== Conditional jump or move depends on uninitialised value(s)
==31946==    at 0x5B32AEF: opal_interval_tree_reader_get_token (opal_interval_tree.c:127)
==31946==    by 0x5B34088: opal_interval_tree_traverse (opal_interval_tree.c:734)
==31946==    by 0x5BFE5DB: mca_rcache_base_vma_tree_iterate (rcache_base_vma_tree.c:105)
==31946==    by 0x5BFE1B9: mca_rcache_base_vma_iterate (rcache_base_vma.c:153)
==31946==    by 0xF361305: mca_rcache_grdma_finalize (rcache_grdma_module.c:543)
==31946==    by 0x5BFDB6F: mca_rcache_base_module_destroy (rcache_base_create.c:113)
==31946==    by 0xFDA2A2C: device_destruct (btl_openib_component.c:993)
==31946==    by 0xFD9735E: opal_obj_run_destructors (opal_object.h:483)
==31946==    by 0xFD9D3F5: mca_btl_openib_finalize_resources (btl_openib.c:1716)
==31946==    by 0xFD9D4F6: mca_btl_openib_finalize (btl_openib.c:1743)
==31946==    by 0x5B9B296: mca_btl_base_close (btl_base_frame.c:203)
==31946==    by 0x5B7EB9F: mca_base_framework_close (mca_base_framework.c:216)
==31946==  Uninitialised value was created by a heap allocation
==31946==    at 0x4C29110: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==31946==    by 0x5BFDDEB: opal_obj_new (opal_object.h:507)
==31946==    by 0x5BFDC93: opal_obj_new_debug (opal_object.h:263)
==31946==    by 0x5BFDFB3: mca_rcache_base_vma_module_alloc (rcache_base_vma.c:56)
==31946==    by 0xF36011E: mca_rcache_grdma_cache_contructor (rcache_grdma_module.c:88)
==31946==    by 0xF3615F6: opal_obj_run_constructors (opal_object.h:461)
==31946==    by 0xF361711: opal_obj_new (opal_object.h:515)
==31946==    by 0xF36156B: opal_obj_new_debug (opal_object.h:263)
==31946==    by 0xF361D03: grdma_init (rcache_grdma_component.c:123)
==31946==    by 0x5BFDA4C: mca_rcache_base_module_create (rcache_base_create.c:87)
==31946==    by 0xFDA590E: init_one_device (btl_openib_component.c:1877)
==31946==    by 0xFDA7F24: btl_openib_component_init (btl_openib_component.c:2880)
==31946== 
==31946== Use of uninitialised value of size 8
==31946==    at 0x5B31CBC: opal_thread_compare_exchange_strong_32 (thread_usage.h:160)
==31946==    by 0x5B32B34: opal_interval_tree_reader_get_token (opal_interval_tree.c:134)
==31946==    by 0x5B34088: opal_interval_tree_traverse (opal_interval_tree.c:734)
==31946==    by 0x5BFE5DB: mca_rcache_base_vma_tree_iterate (rcache_base_vma_tree.c:105)
==31946==    by 0x5BFE1B9: mca_rcache_base_vma_iterate (rcache_base_vma.c:153)
==31946==    by 0xF361305: mca_rcache_grdma_finalize (rcache_grdma_module.c:543)
==31946==    by 0x5BFDB6F: mca_rcache_base_module_destroy (rcache_base_create.c:113)
==31946==    by 0xFDA2A2C: device_destruct (btl_openib_component.c:993)
==31946==    by 0xFD9735E: opal_obj_run_destructors (opal_object.h:483)
==31946==    by 0xFD9D3F5: mca_btl_openib_finalize_resources (btl_openib.c:1716)
==31946==    by 0xFD9D4F6: mca_btl_openib_finalize (btl_openib.c:1743)
==31946==    by 0x5B9B296: mca_btl_base_close (btl_base_frame.c:203)
==31946==  Uninitialised value was created by a heap allocation
==31946==    at 0x4C29110: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==31946==    by 0x5BFDDEB: opal_obj_new (opal_object.h:507)
==31946==    by 0x5BFDC93: opal_obj_new_debug (opal_object.h:263)
==31946==    by 0x5BFDFB3: mca_rcache_base_vma_module_alloc (rcache_base_vma.c:56)
==31946==    by 0xF36011E: mca_rcache_grdma_cache_contructor (rcache_grdma_module.c:88)
==31946==    by 0xF3615F6: opal_obj_run_constructors (opal_object.h:461)
==31946==    by 0xF361711: opal_obj_new (opal_object.h:515)
==31946==    by 0xF36156B: opal_obj_new_debug (opal_object.h:263)
==31946==    by 0xF361D03: grdma_init (rcache_grdma_component.c:123)
==31946==    by 0x5BFDA4C: mca_rcache_base_module_create (rcache_base_create.c:87)
==31946==    by 0xFDA590E: init_one_device (btl_openib_component.c:1877)
==31946==    by 0xFDA7F24: btl_openib_component_init (btl_openib_component.c:2880)
==31946== 
==31946== Use of uninitialised value of size 8
==31946==    at 0x5B31CCF: opal_thread_compare_exchange_strong_32 (thread_usage.h:160)
==31946==    by 0x5B32B34: opal_interval_tree_reader_get_token (opal_interval_tree.c:134)
==31946==    by 0x5B34088: opal_interval_tree_traverse (opal_interval_tree.c:734)
==31946==    by 0x5BFE5DB: mca_rcache_base_vma_tree_iterate (rcache_base_vma_tree.c:105)
==31946==    by 0x5BFE1B9: mca_rcache_base_vma_iterate (rcache_base_vma.c:153)
==31946==    by 0xF361305: mca_rcache_grdma_finalize (rcache_grdma_module.c:543)
==31946==    by 0x5BFDB6F: mca_rcache_base_module_destroy (rcache_base_create.c:113)
==31946==    by 0xFDA2A2C: device_destruct (btl_openib_component.c:993)
==31946==    by 0xFD9735E: opal_obj_run_destructors (opal_object.h:483)
==31946==    by 0xFD9D3F5: mca_btl_openib_finalize_resources (btl_openib.c:1716)
==31946==    by 0xFD9D4F6: mca_btl_openib_finalize (btl_openib.c:1743)
==31946==    by 0x5B9B296: mca_btl_base_close (btl_base_frame.c:203)
==31946==  Uninitialised value was created by a heap allocation
==31946==    at 0x4C29110: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==31946==    by 0x5BFDDEB: opal_obj_new (opal_object.h:507)
==31946==    by 0x5BFDC93: opal_obj_new_debug (opal_object.h:263)
==31946==    by 0x5BFDFB3: mca_rcache_base_vma_module_alloc (rcache_base_vma.c:56)
==31946==    by 0xF36011E: mca_rcache_grdma_cache_contructor (rcache_grdma_module.c:88)
==31946==    by 0xF3615F6: opal_obj_run_constructors (opal_object.h:461)
==31946==    by 0xF361711: opal_obj_new (opal_object.h:515)
==31946==    by 0xF36156B: opal_obj_new_debug (opal_object.h:263)
==31946==    by 0xF361D03: grdma_init (rcache_grdma_component.c:123)
==31946==    by 0x5BFDA4C: mca_rcache_base_module_create (rcache_base_create.c:87)
==31946==    by 0xFDA590E: init_one_device (btl_openib_component.c:1877)
==31946==    by 0xFDA7F24: btl_openib_component_init (btl_openib_component.c:2880)
==31946== 
==31946== Use of uninitialised value of size 8
==31946==    at 0x5B32B5D: opal_interval_tree_reader_return_token (opal_interval_tree.c:142)
==31946==    by 0x5B340D7: opal_interval_tree_traverse (opal_interval_tree.c:736)
==31946==    by 0x5BFE5DB: mca_rcache_base_vma_tree_iterate (rcache_base_vma_tree.c:105)
==31946==    by 0x5BFE1B9: mca_rcache_base_vma_iterate (rcache_base_vma.c:153)
==31946==    by 0xF361305: mca_rcache_grdma_finalize (rcache_grdma_module.c:543)
==31946==    by 0x5BFDB6F: mca_rcache_base_module_destroy (rcache_base_create.c:113)
==31946==    by 0xFDA2A2C: device_destruct (btl_openib_component.c:993)
==31946==    by 0xFD9735E: opal_obj_run_destructors (opal_object.h:483)
==31946==    by 0xFD9D3F5: mca_btl_openib_finalize_resources (btl_openib.c:1716)
==31946==    by 0xFD9D4F6: mca_btl_openib_finalize (btl_openib.c:1743)
==31946==    by 0x5B9B296: mca_btl_base_close (btl_base_frame.c:203)
==31946==    by 0x5B7EB9F: mca_base_framework_close (mca_base_framework.c:216)
==31946==  Uninitialised value was created by a heap allocation
==31946==    at 0x4C29110: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==31946==    by 0x5BFDDEB: opal_obj_new (opal_object.h:507)
==31946==    by 0x5BFDC93: opal_obj_new_debug (opal_object.h:263)
==31946==    by 0x5BFDFB3: mca_rcache_base_vma_module_alloc (rcache_base_vma.c:56)
==31946==    by 0xF36011E: mca_rcache_grdma_cache_contructor (rcache_grdma_module.c:88)
==31946==    by 0xF3615F6: opal_obj_run_constructors (opal_object.h:461)
==31946==    by 0xF361711: opal_obj_new (opal_object.h:515)
==31946==    by 0xF36156B: opal_obj_new_debug (opal_object.h:263)
==31946==    by 0xF361D03: grdma_init (rcache_grdma_component.c:123)
==31946==    by 0x5BFDA4C: mca_rcache_base_module_create (rcache_base_create.c:87)
==31946==    by 0xFDA590E: init_one_device (btl_openib_component.c:1877)
==31946==    by 0xFDA7F24: btl_openib_component_init (btl_openib_component.c:2880)
==31946== 
==31946== 
==31946== HEAP SUMMARY:
==31946==     in use at exit: 568,113 bytes in 6,896 blocks
==31946==   total heap usage: 46,678 allocs, 39,782 frees, 846,341,962 bytes allocated
==31946== 
==31946== LEAK SUMMARY:
==31946==    definitely lost: 28,700 bytes in 66 blocks
==31946==    indirectly lost: 10,455 bytes in 23 blocks
==31946==      possibly lost: 1,768 bytes in 2 blocks
==31946==    still reachable: 527,190 bytes in 6,805 blocks
==31946==         suppressed: 0 bytes in 0 blocks
==31946== Rerun with --leak-check=full to see details of leaked memory
==31946== 
==31946== For counts of detected and suppressed errors, rerun with: -v
==31946== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 0 from 0)
@abkein abkein changed the title Valgrind: using uninitialized value Valgrind: use of uninitialized value Jan 22, 2025
@devreal
Copy link
Contributor

devreal commented Jan 22, 2025

@abkein Thanks for the report. #13049 addresses the vma part of this. The first warnings come from the openib component that was removed in Open MPI 5.0 so I'm not sure it's worth digging into that. Could you please try the patch to see whether that fixes the vma reports?

@bosilca
Copy link
Member

bosilca commented Jan 22, 2025

@abkein can you please try the following patch:

diff --git a/opal/class/opal_interval_tree.c b/opal/class/opal_interval_tree.c
index 110dccdacc..0df81cf089 100644
--- a/opal/class/opal_interval_tree.c
+++ b/opal/class/opal_interval_tree.c
@@ -81,6 +81,7 @@ static void opal_interval_tree_construct (opal_interval_tree_t *tree)
     tree->tree_size = 0;
     tree->lock = 0;
     tree->reader_count = 0;
+    tree->reader_id = 0;
     tree->epoch = 0;
 
     /* set all reader epochs to UINT_MAX. this value is used to simplfy
diff --git a/opal/mca/rcache/base/rcache_base_vma.h b/opal/mca/rcache/base/rcache_base_vma.h
index 261cad6719..bf066441c3 100644
--- a/opal/mca/rcache/base/rcache_base_vma.h
+++ b/opal/mca/rcache/base/rcache_base_vma.h
@@ -43,9 +43,6 @@ struct mca_rcache_base_registration_t;
 struct mca_rcache_base_vma_module_t {
     opal_object_t super;
     opal_interval_tree_t tree;
-    opal_list_t vma_list;
-    opal_lifo_t vma_gc_lifo;
-    size_t reg_cur_cache_size;
     opal_mutex_t vma_lock;
 };
 typedef struct mca_rcache_base_vma_module_t mca_rcache_base_vma_module_t;
diff --git a/opal/mca/rcache/base/rcache_base_vma_tree.c b/opal/mca/rcache/base/rcache_base_vma_tree.c
index 09362f4f2b..d261e03ce9 100644
--- a/opal/mca/rcache/base/rcache_base_vma_tree.c
+++ b/opal/mca/rcache/base/rcache_base_vma_tree.c
@@ -34,7 +34,6 @@
 int mca_rcache_base_vma_tree_init (mca_rcache_base_vma_module_t *vma_module)
 {
     OBJ_CONSTRUCT(&vma_module->tree, opal_interval_tree_t);
-    vma_module->reg_cur_cache_size = 0;
     return opal_interval_tree_init (&vma_module->tree);
 }
 
diff --git a/opal/mca/rcache/grdma/rcache_grdma_module.c b/opal/mca/rcache/grdma/rcache_grdma_module.c
index 7bf1748e47..e3058a1001 100644
--- a/opal/mca/rcache/grdma/rcache_grdma_module.c
+++ b/opal/mca/rcache/grdma/rcache_grdma_module.c
@@ -80,11 +80,10 @@ static int check_for_cuda_freed_memory(mca_rcache_base_module_t *rcache, void *a
 #endif /* OPAL_CUDA_GDR_SUPPORT */
 static void mca_rcache_grdma_cache_contructor (mca_rcache_grdma_cache_t *cache)
 {
-    memset ((void *)((uintptr_t)cache + sizeof (cache->super)), 0, sizeof (*cache) - sizeof (cache->super));
-
     OBJ_CONSTRUCT(&cache->lru_list, opal_list_t);
     OBJ_CONSTRUCT(&cache->gc_lifo, opal_lifo_t);
 
+    cache->cache_name = NULL;
     cache->vma_module = mca_rcache_base_vma_module_alloc ();
 }
 

@abkein
Copy link
Author

abkein commented Jan 22, 2025

I turned out all it was false positives (probably). Compiling with --enable-memchecker --with-valgrind=/... removes all error messages

@bosilca, yes, it solved related errors, when compiling without --enable-memchecker --with-valgrind=/... flags.

@bosilca
Copy link
Member

bosilca commented Jan 22, 2025

ok, so they were not false positive. The root cause is that the rcache modules are lazily initialized (on the first use), so when they are not used they don't correctly finalize. I will create a PR with this.

@jsquyres jsquyres modified the milestones: v4.1.7, v4.1.8 Jan 23, 2025
bosilca added a commit to bosilca/ompi that referenced this issue Jan 27, 2025
Fixes open-mpi#13047

Signed-off-by: George Bosilca <[email protected]>
(cherry picked from commit ab13add)
bosilca added a commit to bosilca/ompi that referenced this issue Jan 27, 2025
Fixes open-mpi#13047

Signed-off-by: George Bosilca <[email protected]>
(cherry picked from commit ab13add)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants