A Lifestyle Related Disease Prediction Framework Based on Missing Value Imputation and Stacking Ensemble Method
Summary :
Industrialized countries have come to the conclusion that numerous chronic non-communicable diseases are caused by lifestyle-related factors after completing numerous epidemiological studies on these conditions, and can be called lifestyle related diseases (LRDs). Obesity, high blood pressure, coronary heart disease and other cardiovascular diseases, stroke and other cerebrovascular diseases, diabetes, and several malignant tumors are among the diseases that are included. All of these conditions pose a major threat to people's lives and health and are challenging to treat with current medical technology.In this context, the prevention of lifestyle-related diseases is extremely important. Disease prediction facilitates early detection to improve the chances of positive health outcomes. Therefore, this study aims to propose a lifestyle related disease prediction framework based on missing value imputation and stacking ensemble method. Specifically, the application of information technology in the medical field is resulting in a large amount of medical data. However, due to early withdrawal and refusal of participants, there are a lot of missing values in medical data. We proposed an imputation method based on SMOTE-NC oversampling technology and the ALWRF method for imbalanced and mixed-type data, called SncALWRFI. Meanwhile, Bayesian optimization and cross-validation are employed to search optimal parameters. In the experiment for missing value imputation, the SncALWRFI shows the best imputation accuracy, and it performs high imputation effectiveness in public datasets with characteristics of data imbalance and mixed-type.Since prediction performance can be easily impacted by the presence of noise, we have to look for a good strategy to improve this situation. Noise may come from real patients and cannot be removed directly. Meanwhile, ensemble approaches are great way to lower variation, bias, and noise. Therefore, in order to increase prediction performance of lifestyle related diseases, we employ the stacking ensemble technology in our study. Specifically, in order to maximize the diversity and the accuracy of ensemble models simultaneously, we proposed a Multi-objective Iterative Model Selection (MoItMS) algorithm. Data was obtained from the National Health and Nutrition Examination Survey from 2007 to 2018. Our study utilized an imbalanced data set of 11,341 with (67.16 %) non-hypertensive patients, and (32.84 %) hypertensive patients. The results indicate a sensitivity of 51.41 %, a specificity of 70.48 %, accuracy of 76.62 % and a measured AUC (Area under the ROC Curve) of 0.84, which outperformed 12 individual and ensemble models. The proposed ensemble model can be implemented in applications to assist population health management programs in identifying patients with high risk of developing hypertension.The missing value module, feature selection module, and disease prediction module are the three main elements of the architecture we propose for LRDs prediction. In view of the large number of missing values in the data set related to lifestyle-related diseases, the missing value module uses a combination of deletion and imputation to deal with missing values. Since different lifestyle-related diseases have different relevant features, the feature selection module uses machine learning-based feature selection to find key features for lifestyle-related diseases. Finally, we use a scenario from a Chinese hospital to apply the suggested prediction framework. According to the experimental findings, the proposed prediction framework can also enhance LRD's prevention performance.
Keywords: Lifestyle related diseases, Prediction, Machine Learning, Missing values, Stacking ensemble