-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR to record the prevalence of disease/condition, still births, neonatal death, maternal mortality #1455
base: master
Are you sure you want to change the base?
Conversation
src/tlo/methods/demography.py
Outdated
(df['cause_of_death'] == 'TB')) & | ||
(df['date_of_death'] >= (self.sim.date - DateOffset(months=1))) | ||
]) | ||
direct_deaths_non_hiv = len(df.loc[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are non-HIV deaths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! Yes well spotted, have changed that now
src/tlo/methods/healthburden.py
Outdated
self._years_written_to_log += [year] | ||
def write_to_log_prevalence_monthly(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Monthly prevalence logs seem reasonable, but bear in mind that some conditions, e.g. malaria can develop and resolve within 1 week. In these cases, we could think about using the clinical counter - which counts episodes of disease. It may not make too much difference but perhaps running a daily then monthly logger and checking if the prevalences vary considerably would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made this change below
src/tlo/methods/healthburden.py
Outdated
# Check that the format of the internal storage is as expected. | ||
self.check_multi_index() | ||
|
||
log_df_line_by_line( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere it would be useful to store a definition of the way prevalence is reported for each disease module, e.g. ALRI logger will record prevelance of pneumonia/other ALRI, malaria logger should record only clinical/severe cases, COPD is stage 3 and above etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have added that!
src/tlo/methods/healthburden.py
Outdated
# Create a DataFrame with one row and assign the population size | ||
prevalence_from_each_disease_module = pd.DataFrame({'population': [population_size]}) | ||
for disease_module_name in self.module.recognised_modules_names: | ||
if disease_module_name in ['NewbornOutcomes', 'PostnatalSupervisor', 'Mockitis', 'DiseaseThatCausesA', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these placeholders left in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mockitis is now removed form the list, as it is being included in the test as a dummy check of prevalences. For the others, however, they do not return any prevalences/conditions, and to avoid a log of lots of columns of zeroes, I have it so that they're skipped over here.
@@ -755,6 +755,16 @@ def report_daly_values(self): | |||
|
|||
return health_values.loc[df.is_alive] # returns the series | |||
|
|||
|
|||
def report_prevalence(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should record clinical and severe cases only. Usually we are interested in the prevalence of symptomatic malaria. Parasite prevalence, which would be ma_is_infected=True is an alternative measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay great, good to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for changing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note also, if reporting point prevalence each month, this could potentially miss some cases of malaria which occur and then resolve within the month. One other option could be to use the property ma_date_symptoms to find all malaria cases who have had onset of symptoms within the last time period. Whichever way you prefer is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting! I suppose most of the other modules are also point prevalence, so maybe for consistency with that, we keep it this way. But if having more of a period prevalence is more useful to you/in general, happy to change it!
@@ -2314,6 +2314,15 @@ def report_daly_values(self): | |||
disability_series_for_alive_persons = df.loc[df.is_alive, "rt_disability"] | |||
return disability_series_for_alive_persons | |||
|
|||
def report_prevalence(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this property rt_road_traffic_inc relates to having been involved in a RTI. The property rt_inj_severity can be none, ie. no injuries arising. For this, perhaps we should log rt_inj_severity != 'none'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I had
df = self.sim.population.props
total_prev = len(
df[(df['is_alive']) & (df['rt_inj_severity'] != 'none')]
) / len(df[df['is_alive']])
return total_prev
Originally, but I think Margherita implied that this would include a consideration of recovery time. Do you think it's okay to use, still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could possibly select cases using property rt_date_inj to make sure injury occurred in most recent time period!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good idea! I've made that change now
src/tlo/methods/schisto.py
Outdated
def report_prevalence(self): | ||
# This returns dataframe that reports on the prevalence of schisto for all individuals | ||
df = self.sim.population.props | ||
is_infected = (df[self.cols_of_infection_status] == 'Non-infected').any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to log infection status is either ['Low-infection', 'High-infection']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I see now it should have been !=, but I have changed to your way for clarity, thank you!
@@ -1009,6 +1009,16 @@ def report_daly_values(self): | |||
|
|||
return health_values.loc[df.is_alive] | |||
|
|||
def report_prevalence(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should log only active cases, this way we can compare with WHO reports / GBD etc. Also the way that we assign latent cases is not identical to other models, we don't have infections -> latent -> active so we would under-estimate the latent infections. Best to stick to symptomatic active cases only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's great to know, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change
assert (df.dtypes == orig.dtypes).all() | ||
|
||
|
||
def find_closest_recording(prevalence, target_date, log_value, column_name, multiply_by_pop): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of the usefulness of finding the closest reported prevalence value. Would a useful test perhaps be to check all registered modules are logging prevalence every month and the inverse of this (no prevalence reported if module not registered), or set the incidence to 0 for one disease and check logger not reporting anything above 0, assert prevalence values for 2010 within reasonable range, e.g. test for extreme or unlikely values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a long discussion about the test yesterday on our call Part II. We decided to include a dummy disease with which to compare prevalences (this will be in the new test file), and to see if the prevalence of what it reports matches with what has been reported it its own logging file.
I suppose by doing it the way that I was doing it, I was trying to see if the calculations themselves were working, as well as the general mechanics of logging. But do you think such a test is unnecessary? And that by showing e.g. with a dummy module and/or what you have suggested above, it would suffice?
src/tlo/methods/healthburden.py
Outdated
sim.schedule_event(Healthburden_WriteToLog(self), last_day_of_the_year) | ||
sim.schedule_event(Get_Current_DALYS(self), sim.date + DateOffset(months=1)) | ||
if self.parameters['test']: | ||
sim.schedule_event(Get_Current_Prevalence(self), sim.date + DateOffset(months=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we make a choice here for the frequency of logging - if we are interested in very rapidly developing/resolving conditions we could set to daily logger, for broader analyses we could set to annual!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have included a parameter now that allows us to set the time of logging as either daily, monthly, or yearly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just adding here the comment made in-person: I like the way this is being done overall.... and we should add this new method to the Module
base class so that it's formally part of the definition of a disease module (in the same way that report_dalys
is.)
Grand! Have that done |
src/tlo/methods/healthburden.py
Outdated
@@ -58,7 +59,8 @@ def __init__(self, name=None, resourcefilepath=None): | |||
'Age_Limit_For_YLL': Parameter( | |||
Types.REAL, 'The age up to which deaths are recorded as having induced a lost of life years'), | |||
'gbd_causes_of_disability': Parameter( | |||
Types.LIST, 'List of the strings of causes of disability defined in the GBD data') | |||
Types.LIST, 'List of the strings of causes of disability defined in the GBD data'), | |||
'logging_frequency_prevalence': Parameter(Types.BOOL, 'Set to the frequency at which we want to make calculations of the prevalence logger') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
types.BOOL looks wrong as it accept string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes, that was a hangover from an earlier iteration. Corrected now.
src/tlo/methods/healthburden.py
Outdated
@@ -99,6 +101,7 @@ def initialise_simulation(self, sim): | |||
self.years_life_lost_stacked_time = pd.DataFrame(index=self.multi_index_for_age_and_wealth_and_time) | |||
self.years_life_lost_stacked_age_and_time = pd.DataFrame(index=self.multi_index_for_age_and_wealth_and_time) | |||
self.years_lived_with_disability = pd.DataFrame(index=self.multi_index_for_age_and_wealth_and_time) | |||
self.prevalence_of_diseases = pd.DataFrame(index=year_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want it to take different frequencies (e.g month, year etc) then we'd need a difference index.
).groupby(level=1).sum() \ | ||
.assign(year=date_of_death.year) \ | ||
.set_index(['year'], append=True)['person_years'] \ | ||
.pipe(_format_for_multi_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes (and those above) look like they're just formatting changes. So we'll roll these back before merging.
src/tlo/methods/healthburden.py
Outdated
sim.schedule_event(Get_Current_Prevalence(self), sim.date + DateOffset(days=0)) | ||
sim.schedule_event(Healthburden_WriteToLog_Prevalences(self), sim.date + DateOffset(days=0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these two events happen at the same frequency and we want to guarantee the order they happen in, I think they should be ONE event.
src/tlo/methods/healthburden.py
Outdated
# 5) Log the prevalence of each disease | ||
log_df_line_by_line( | ||
key='prevalence_of_diseases', | ||
description='Prevalence of each disease.', | ||
df=self.prevalence_of_diseases, | ||
force_cols=self.recognised_modules_names, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this unintentionally left-in? The function is defined below (embedded in write-to-log-prevalence)
src/tlo/methods/healthburden.py
Outdated
disease_module_name) | ||
|
||
# Add the prevalence data as a new column to the DataFrame | ||
prevalence_from_each_disease_module[column_name] = prevalence_from_disease_module.iloc[:, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in most cases, there module is returning a single number; sometimes a set of numbers. Could this be defined by a dict instead, for simplicity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to use the log_df_line_by_line, it needs to be a dataframe, but I may have misinterpreted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: are all now dictionaries
end_date = Date(2012, 1, 1) | ||
|
||
popsize = 1000 | ||
seed = 42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seed can be a 'magical' kwarg to a function beginning with test and pyyest will populate it for you. (same as tmpdir)
prevalence_mockitis_log["TotalInf"][j]) | ||
|
||
if target_date <= max_date_in_prevalence: | ||
find_closest_recording(prevalence, target_date, regular_log_value, 'Mockitis', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this extra step of finding the closest recording. For this dummy we can set the logging frquency to be the same, so that we can do a straight-forward comparison, can't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mockitis logs every 6 months, so rather than setting a more detailed frequency for logging, I have kept the "closest match" date here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited: new dummydisease in mockitis file. Set the logging frequency to be the same, so no need for this function
src/tlo/methods/healthburden.py
Outdated
population_size = len(self.sim.population.props[self.sim.population.props['is_alive']]) | ||
|
||
# Create a DataFrame with one row and assign the population size | ||
prevalence_from_each_disease_module = pd.DataFrame({'population': [population_size]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why this is dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initiated it a sa dataframe that I could then populate. I think for the the logging line-by-line function it needs to be a dataframe (like the DALYs), but I may have misinterpreted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: Have now changed so that the prevalence is now collected as dictionaries
src/tlo/methods/healthburden.py
Outdated
""" | ||
|
||
def __init__(self, module): | ||
super().__init__(module, frequency=DateOffset(months=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its very murky if we allow this frquency to be differnt to the logging frequncy, That's why I think combine these two events into one.
(Based on conversation yesterday) - changed so that it is no longer in base class (as not all modules are disease modules), but there is an assertion checking to see that if something uses the healthburden module, it must have the report_prevalence function |
…l individuals. Based solely on "gi_has_diarrhoea", not dehydration, pathogen, etc.
…culated as they are for the "proportion_of_something_in_a_groupby_ready_for_logging", but across everyone and not over age/sex
…culated as they are for the "proportion_of_something_in_a_groupby_ready_for_logging", but across everyone and not over age/sex
… returned as all 0s.
…terest, as previously was only looking at module classes and was therefore missing things like polling events
…e of below-1 mortality. Life table calculations assume that 50% of the age group is survived; from this analysis, looks like >60% of children die before 6 months, violating that assumption. Major driver could be encephalopathy?
…g. Unsure it did much
…s a big decrease in demand on the healthcare system
…rld and added in the HTM scale up. Working on lifestyle examinations
isort
# Conflicts: # src/scripts/get_properties/properties_graph.py
…into rmw/log_prevalence_all_disease
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
The point prevalence is recorded for a number of modules and conditions within modules (Alri, BladderCancer, BreastCancer, CardioMetabolicDisorders ( chronic_ischemic_hd, chronic_kidney_disease, chronic_lower_back_pain, diabetes, hypertension), COPD, Depression, Diarrhoea, Epilepsy, Hiv, Labor (Intrapartum stillbirth), Malaria, Measles, NewbornOutcomes, OesophagealCancer, OtherAdultCancer, PostnatalSupervisor, PregnancySupervisor (Antenatal stillbirth), ProstateCancer, RTI, Schisto, TB, Demography (maternal_deaths, newborn_deaths).
Additional questions:
- Okay to calculate the prevalence of diarrhoea? It is a not really a disease in its own right, more of a symptom
- For some modules (RTI), may be more accurately described by calculating incidence, rather than prevalence. Is that useful/okay? Or should it be skipped?
Other notes:
- COPD is defined as ch_lung_function > 3.
- Have not included events in the CardioMetabolic Module (ever_heart_attack and ever_stroke) as would be a cumulative incidence living people who have had such events