Improving Prediction Accuracy Based On Optimized Random Forest Model with Weighted Sampling for Regression Trees

S. Bharathidason; C. Jothi Venkataeswaran

doi:https://doi.org/10.14445/22312803/IJCTT-V21P105

Research Article | Open Access | Download PDF

Volume 21 | Number 1 | Year 2015 | Article Id. IJCTT-V21P105 | DOI : https://doi.org/10.14445/22312803/IJCTT-V21P105

Improving Prediction Accuracy Based On Optimized Random Forest Model with Weighted Sampling for Regression Trees

S. Bharathidason, C. Jothi Venkataeswaran

Citation :

S. Bharathidason, C. Jothi Venkataeswaran, "Improving Prediction Accuracy Based On Optimized Random Forest Model with Weighted Sampling for Regression Trees," International Journal of Computer Trends and Technology (IJCTT), vol. 21, no. 1, pp. 23-28, 2015. Crossref, https://doi.org/10.14445/22312803/IJCTT-V21P105

Abstract

Random Forest (RF) is an ensemble, supervised machine learning technique useful for regression and classification problems. Random forest algorithms tend to use a simple random sampling of observations in building their decision trees. In random forest, random selection has the chance for noisy and outlier data to take place during the construction of trees. This leads to inappropriate and poor ensemble prediction decision. Appropriately handling noise and outliers is an important issue in data mining. This paper aims to optimize, the sample selection through probability proportional to size sampling (weighted sampling) in which the noisy and outlier data points are down weighted to improve the prediction performance by minimizing the error rate in the model. Experimental results have shown that, the random forest can be further enhanced in terms of minimizing the prediction error with weighted sampling.

Keywords

Random Forest, Weighted sampling, Decision trees, Noisy data, Outlier.

References

[1] Michael R. Smith and Tony Martinez, “Improving classification accuracy by identifying and removing instances that should be misclassified”, in Proceedings of the, The 2011 International Joint Conference on neural networks, IEEE, 2011, pp. 2690 – 2697.
[2] Barnett, V. and T. Lewis, Outliers in statistical data, John Wiley & Sons, pp.1, 1978.
[3] Quinlan, J. R., C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, USA, 1993.
[4] Han J and Kamber M, Data Mining: Concepts and Techniques, (2nd Edition), Morgan Kaufmann Publisher. pp. 258. 2006.
[5] Breiman, L, “Random Forests”. Machine Learning, Vol. 45 Issue 1, pp. 5-32, 2001.
[6] Baoxun Xu, Junjie Li, Qiang Wang, Xiaojun Chen, “A Tree Selection Model for Improved Random Forest”, Bulletin of advanced technology research, vol.6(2), 2012.
[7] Dietterich, T.G. “An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization,” Machine Learning, vol. 40(2):139–157, 2000.
[8] Ho, T. “The random subspace method for constructing decision forests”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20(8):832–844, 1998.
[9] Banfield, R.E., L.O. Hall, K.W. Bowyer and W.P. Kegelmeyer, “A Comparison of Decision Tree Ensemble Creation Techniques”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29(1): 173–180, 2007.
[10] Harrison, D. and Rubinfeld, D.L, “Hedonic prices and the demand for clean air”, J. Environ. Economics & Management, vol. 5: 81-102, 1978.
[11] Cheng I. Yeh, “Modeling of strength of high performance concrete using artificial neural networks”, Cement and Concrete Research, vol. 28 (12):1797-1808, 1998.
[12] Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M.Bateman M. Klatt NE, “Prospective evaluation of prognostic variables from patient-completed questionnaires”. North Central Cancer Treatment Group. Journal of Clinical Oncology, vol. 12(3):601-7, 1994.
[13] Sereno, F. et al., “The Application of Radial Basis Functions and Support Vector Machines to the Foetal Weight Prediction”. Intell Eng Syst Through Artif Neural Networks, vol. 10: 801-806, 2000.