Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR to record the prevalence of disease/condition, still births, neonatal death, maternal mortality #1455

Open
wants to merge 688 commits into
base: master
Choose a base branch
from

Conversation

RachelMurray-Watson
Copy link
Collaborator

@RachelMurray-Watson RachelMurray-Watson commented Aug 15, 2024

The point prevalence is recorded for a number of modules and conditions within modules (Alri, BladderCancer, BreastCancer, CardioMetabolicDisorders ( chronic_ischemic_hd, chronic_kidney_disease, chronic_lower_back_pain, diabetes, hypertension), COPD, Depression, Diarrhoea, Epilepsy, Hiv, Labor (Intrapartum stillbirth), Malaria, Measles, NewbornOutcomes, OesophagealCancer, OtherAdultCancer, PostnatalSupervisor, PregnancySupervisor (Antenatal stillbirth), ProstateCancer, RTI, Schisto, TB, Demography (maternal_deaths, newborn_deaths).

Additional questions:
- Okay to calculate the prevalence of diarrhoea? It is a not really a disease in its own right, more of a symptom
- For some modules (RTI), may be more accurately described by calculating incidence, rather than prevalence. Is that useful/okay? Or should it be skipped?

Other notes:
- COPD is defined as ch_lung_function > 3.
- Have not included events in the CardioMetabolic Module (ever_heart_attack and ever_stroke) as would be a cumulative incidence living people who have had such events

@RachelMurray-Watson RachelMurray-Watson marked this pull request as ready for review August 22, 2024 09:19
(df['cause_of_death'] == 'TB')) &
(df['date_of_death'] >= (self.sim.date - DateOffset(months=1)))
])
direct_deaths_non_hiv = len(df.loc[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are non-HIV deaths?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! Yes well spotted, have changed that now

self._years_written_to_log += [year]
def write_to_log_prevalence_monthly(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monthly prevalence logs seem reasonable, but bear in mind that some conditions, e.g. malaria can develop and resolve within 1 week. In these cases, we could think about using the clinical counter - which counts episodes of disease. It may not make too much difference but perhaps running a daily then monthly logger and checking if the prevalences vary considerably would be useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made this change below

# Check that the format of the internal storage is as expected.
self.check_multi_index()

log_df_line_by_line(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere it would be useful to store a definition of the way prevalence is reported for each disease module, e.g. ALRI logger will record prevelance of pneumonia/other ALRI, malaria logger should record only clinical/severe cases, COPD is stage 3 and above etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have added that!

# Create a DataFrame with one row and assign the population size
prevalence_from_each_disease_module = pd.DataFrame({'population': [population_size]})
for disease_module_name in self.module.recognised_modules_names:
if disease_module_name in ['NewbornOutcomes', 'PostnatalSupervisor', 'Mockitis', 'DiseaseThatCausesA',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these placeholders left in?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mockitis is now removed form the list, as it is being included in the test as a dummy check of prevalences. For the others, however, they do not return any prevalences/conditions, and to avoid a log of lots of columns of zeroes, I have it so that they're skipped over here.

@@ -755,6 +755,16 @@ def report_daly_values(self):

return health_values.loc[df.is_alive] # returns the series


def report_prevalence(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should record clinical and severe cases only. Usually we are interested in the prevalence of symptomatic malaria. Parasite prevalence, which would be ma_is_infected=True is an alternative measure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay great, good to know

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changing this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note also, if reporting point prevalence each month, this could potentially miss some cases of malaria which occur and then resolve within the month. One other option could be to use the property ma_date_symptoms to find all malaria cases who have had onset of symptoms within the last time period. Whichever way you prefer is ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting! I suppose most of the other modules are also point prevalence, so maybe for consistency with that, we keep it this way. But if having more of a period prevalence is more useful to you/in general, happy to change it!

@@ -2314,6 +2314,15 @@ def report_daly_values(self):
disability_series_for_alive_persons = df.loc[df.is_alive, "rt_disability"]
return disability_series_for_alive_persons

def report_prevalence(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this property rt_road_traffic_inc relates to having been involved in a RTI. The property rt_inj_severity can be none, ie. no injuries arising. For this, perhaps we should log rt_inj_severity != 'none'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I had
df = self.sim.population.props
total_prev = len(
df[(df['is_alive']) & (df['rt_inj_severity'] != 'none')]
) / len(df[df['is_alive']])

     return total_prev
 Originally, but I think Margherita implied that this would  include a consideration of recovery time. Do you think it's okay to use, still? 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could possibly select cases using property rt_date_inj to make sure injury occurred in most recent time period!?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good idea! I've made that change now

def report_prevalence(self):
# This returns dataframe that reports on the prevalence of schisto for all individuals
df = self.sim.population.props
is_infected = (df[self.cols_of_infection_status] == 'Non-infected').any()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to log infection status is either ['Low-infection', 'High-infection']

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I see now it should have been !=, but I have changed to your way for clarity, thank you!

@@ -1009,6 +1009,16 @@ def report_daly_values(self):

return health_values.loc[df.is_alive]

def report_prevalence(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should log only active cases, this way we can compare with WHO reports / GBD etc. Also the way that we assign latent cases is not identical to other models, we don't have infections -> latent -> active so we would under-estimate the latent infections. Best to stick to symptomatic active cases only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's great to know, thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change

assert (df.dtypes == orig.dtypes).all()


def find_closest_recording(prevalence, target_date, log_value, column_name, multiply_by_pop):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the usefulness of finding the closest reported prevalence value. Would a useful test perhaps be to check all registered modules are logging prevalence every month and the inverse of this (no prevalence reported if module not registered), or set the incidence to 0 for one disease and check logger not reporting anything above 0, assert prevalence values for 2010 within reasonable range, e.g. test for extreme or unlikely values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a long discussion about the test yesterday on our call Part II. We decided to include a dummy disease with which to compare prevalences (this will be in the new test file), and to see if the prevalence of what it reports matches with what has been reported it its own logging file.

I suppose by doing it the way that I was doing it, I was trying to see if the calculations themselves were working, as well as the general mechanics of logging. But do you think such a test is unnecessary? And that by showing e.g. with a dummy module and/or what you have suggested above, it would suffice?

sim.schedule_event(Healthburden_WriteToLog(self), last_day_of_the_year)
sim.schedule_event(Get_Current_DALYS(self), sim.date + DateOffset(months=1))
if self.parameters['test']:
sim.schedule_event(Get_Current_Prevalence(self), sim.date + DateOffset(months=1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make a choice here for the frequency of logging - if we are interested in very rapidly developing/resolving conditions we could set to daily logger, for broader analyses we could set to annual!?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have included a parameter now that allows us to set the time of logging as either daily, monthly, or yearly

Copy link
Collaborator

@tbhallett tbhallett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just adding here the comment made in-person: I like the way this is being done overall.... and we should add this new method to the Module base class so that it's formally part of the definition of a disease module (in the same way that report_dalys is.)

@RachelMurray-Watson
Copy link
Collaborator Author

so that it's formally part of the definition of a disease module

Grand! Have that done

@@ -58,7 +59,8 @@ def __init__(self, name=None, resourcefilepath=None):
'Age_Limit_For_YLL': Parameter(
Types.REAL, 'The age up to which deaths are recorded as having induced a lost of life years'),
'gbd_causes_of_disability': Parameter(
Types.LIST, 'List of the strings of causes of disability defined in the GBD data')
Types.LIST, 'List of the strings of causes of disability defined in the GBD data'),
'logging_frequency_prevalence': Parameter(Types.BOOL, 'Set to the frequency at which we want to make calculations of the prevalence logger')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types.BOOL looks wrong as it accept string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes, that was a hangover from an earlier iteration. Corrected now.

@@ -99,6 +101,7 @@ def initialise_simulation(self, sim):
self.years_life_lost_stacked_time = pd.DataFrame(index=self.multi_index_for_age_and_wealth_and_time)
self.years_life_lost_stacked_age_and_time = pd.DataFrame(index=self.multi_index_for_age_and_wealth_and_time)
self.years_lived_with_disability = pd.DataFrame(index=self.multi_index_for_age_and_wealth_and_time)
self.prevalence_of_diseases = pd.DataFrame(index=year_index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want it to take different frequencies (e.g month, year etc) then we'd need a difference index.

).groupby(level=1).sum() \
.assign(year=date_of_death.year) \
.set_index(['year'], append=True)['person_years'] \
.pipe(_format_for_multi_index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes (and those above) look like they're just formatting changes. So we'll roll these back before merging.

Comment on lines 126 to 127
sim.schedule_event(Get_Current_Prevalence(self), sim.date + DateOffset(days=0))
sim.schedule_event(Healthburden_WriteToLog_Prevalences(self), sim.date + DateOffset(days=0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these two events happen at the same frequency and we want to guarantee the order they happen in, I think they should be ONE event.

Comment on lines 546 to 552
# 5) Log the prevalence of each disease
log_df_line_by_line(
key='prevalence_of_diseases',
description='Prevalence of each disease.',
df=self.prevalence_of_diseases,
force_cols=self.recognised_modules_names,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this unintentionally left-in? The function is defined below (embedded in write-to-log-prevalence)

disease_module_name)

# Add the prevalence data as a new column to the DataFrame
prevalence_from_each_disease_module[column_name] = prevalence_from_disease_module.iloc[:, 0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in most cases, there module is returning a single number; sometimes a set of numbers. Could this be defined by a dict instead, for simplicity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to use the log_df_line_by_line, it needs to be a dataframe, but I may have misinterpreted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: are all now dictionaries

end_date = Date(2012, 1, 1)

popsize = 1000
seed = 42
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed can be a 'magical' kwarg to a function beginning with test and pyyest will populate it for you. (same as tmpdir)

prevalence_mockitis_log["TotalInf"][j])

if target_date <= max_date_in_prevalence:
find_closest_recording(prevalence, target_date, regular_log_value, 'Mockitis', True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this extra step of finding the closest recording. For this dummy we can set the logging frquency to be the same, so that we can do a straight-forward comparison, can't we?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mockitis logs every 6 months, so rather than setting a more detailed frequency for logging, I have kept the "closest match" date here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited: new dummydisease in mockitis file. Set the logging frequency to be the same, so no need for this function

population_size = len(self.sim.population.props[self.sim.population.props['is_alive']])

# Create a DataFrame with one row and assign the population size
prevalence_from_each_disease_module = pd.DataFrame({'population': [population_size]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why this is dataframe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initiated it a sa dataframe that I could then populate. I think for the the logging line-by-line function it needs to be a dataframe (like the DALYs), but I may have misinterpreted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Have now changed so that the prevalence is now collected as dictionaries

"""

def __init__(self, module):
super().__init__(module, frequency=DateOffset(months=1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its very murky if we allow this frquency to be differnt to the logging frequncy, That's why I think combine these two events into one.

@RachelMurray-Watson RachelMurray-Watson self-assigned this Sep 5, 2024
@RachelMurray-Watson
Copy link
Collaborator Author

just adding here the comment made in-person: I like the way this is being done overall.... and we should add this new method to the Module base class so that it's formally part of the definition of a disease module (in the same way that report_dalys is.)

(Based on conversation yesterday) - changed so that it is no longer in base class (as not all modules are disease modules), but there is an assertion checking to see that if something uses the healthburden module, it must have the report_prevalence function

…terest, as previously was only looking at module classes and was therefore missing things like polling events
…e of below-1 mortality. Life table calculations assume that 50% of the age group is survived; from this analysis, looks like >60% of children die before 6 months, violating that assumption. Major driver could be encephalopathy?
…s a big decrease in demand on the healthcare system
…rld and added in the HTM scale up.

Working on lifestyle examinations
# Conflicts:
#	src/scripts/get_properties/properties_graph.py
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants