Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performing a linear regression produces an 'IndexError: tuple index out of range' result #66

Open
kburchfiel opened this issue Jan 31, 2025 · 3 comments

Comments

@kburchfiel
Copy link

kburchfiel commented Jan 31, 2025

Hi there,

I am attempting to perform a weighted linear regression using version 0.4.34 of Samplics. Here is the entirety of my code (which should be reproducible on your end):

import pandas as pd
from samplics.regression import SurveyGLM

df_car_survey = pd.read_csv(
    'https://raw.githubusercontent.com/ifstudies/\
carsurveydata/refs/heads/main/car_survey.csv')

df_car_survey['Enjoy_Driving_Fast_Int'] = (
    df_car_survey['Enjoy_Driving_Fast'].map(
    {'Strongly Agree':5, 'Agree':4, 'Slightly Agree':3, 
     'Slightly Disagree':2, 'Disagree':1, 'Strongly Disagree':0}))

df_car_survey = pd.concat([df_car_survey, pd.get_dummies(
    df_car_survey['Car_Color'], dtype = 'int')],
         axis = 1)
print(df_car_survey.head())

# The following code was based on Samplics' GLM source code, available at
# https://github.com/samplics-org/samplics/blob/main/src/
# samplics/regression/glm.py

slr = SurveyGLM()
slr.estimate(y = df_car_survey['Enjoy_Driving_Fast_Int'],
             x = df_car_survey['Black'],
            samp_weight = df_car_survey['Weight'])

Here's the output of print(df_car_survey.head()) for reference:

  Car_Color    Weight Enjoy_Driving_Fast  Count  Response_Sort_Map  \
0       Red  1.975884     Strongly Agree      1                  0   
1       Red  0.943725     Strongly Agree      1                  0   
2       Red  1.342593     Strongly Agree      1                  0   
3       Red  1.704274     Strongly Agree      1                  0   
4       Red  0.348622     Strongly Agree      1                  0   

   Enjoy_Driving_Fast_Int  Black  Red  White  
0                       5      0    1      0  
1                       5      0    1      0  
2                       5      0    1      0  
3                       5      0    1      0  
4                       5      0    1      0 

When I try to run slr.estimate(), I receive the following error and trackeback:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[1], line 23
     18 # The following code was based on Samplics' GLM source code, available at
     19 # https://github.com/samplics-org/samplics/blob/main/src/
     20 # samplics/regression/glm.py
     22 slr = SurveyGLM()
---> 23 slr.estimate(y = df_car_survey['Enjoy_Driving_Fast_Int'],
     24              x = df_car_survey['Black'],
     25             samp_weight = df_car_survey['Weight'])

File [~/miniforge3/envs/ifs/lib/python3.13/site-packages/samplics/regression/glm.py:100](http://localhost:33733/home/ifskjb3/miniforge3/envs/ifs/lib/python3.13/site-packages/samplics/regression/glm.py#line=99), in SurveyGLM.estimate(self, y, x, samp_weight, stratum, psu, fpc, remove_nan)
     97 glm_model = sm.GLM(endog=_y, exog=_x, var_weights=_samp_weight)
     98 glm_results = glm_model.fit()
--> 100 g = self._calculate_g(
    101     samp_weight=_samp_weight,
    102     resid=glm_results.resid_response,
    103     x=_x,
    104     stratum=_stratum,
    105     psu=_psu,
    106     fpc=self.fpc,
    107     glm_scale=glm_results.scale,
    108 )
    110 d = glm_results.cov_params()
    112 self.beta = glm_results.params

File [~/miniforge3/envs/ifs/lib/python3.13/site-packages/samplics/regression/glm.py:55](http://localhost:33733/home/ifskjb3/miniforge3/envs/ifs/lib/python3.13/site-packages/samplics/regression/glm.py#line=54), in SurveyGLM._calculate_g(self, samp_weight, resid, x, stratum, psu, fpc, glm_scale)
     53     psu = np.arange(e.shape[0])
     54 if stratum.shape in ((), (0,)):
---> 55     e_h, n_h = self._residuals(e=e, psu=psu, nb_vars=x.shape[1])
     56     return fpc * (n_h [/](http://localhost:33733/) (n_h - 1)) * e_h
     57 else:

IndexError: tuple index out of range

It appears that the code is attempting to access the second element of df_car_survey['Black'].shape. However, this shape equals (1059,) , and thus there is no second element.

Thanks in advance for your assistance! Also, I imagine your time is very limited, but adding a documentation page on linear regressions would be a huge help.

@MamadouSDiallo
Copy link
Contributor

Regression and sae are not ready for use. I should add a not implemented tag or something until it's ready.

@kburchfiel
Copy link
Author

Understood! Thank you for the heads up. I imagine your time is quite limited, but if you could let me know when the regression package is ready (perhaps by commenting within this thread), that would be great.

@kburchfiel
Copy link
Author

kburchfiel commented Feb 4, 2025

This update will be especially exciting because I'm not sure of any other Python library that can easily report P values and test statistics for logistic regressions of data with sample weights. (Statsmodels has a 'freq_weights' column, but this is a different concept than the sample weights that Samplics uses.) I can use R's survey and srvyr packages via rpy2 in the meantime, but being able to do all of my weighted survey analyses directly in Python would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants