BibliographyAdams, N. M. and Hand, D. J.. Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32:1139–1147, 1999.
Aha, D.. Generalizing from case studies: A case study. In Proceedings of the 9th International Workshop on Machine Learning (ICML '92), pp. 1–10. Morgan Kaufmann, San Mateo, CA, 1992.
Alaiz-Rodríguez, R. and Japkowicz, N.. Assessing the impact of changing environments on classifier performance. In Proceedings of the 21st Canadian Conference in Artificial Intelligence (AI 2008), Springer, New York, 2008.
Alaiz-Rodríguez, R., Japkowicz, N., and Tischer, P.. Visualizing classifier performance on different domains. In Proceedings of the 2008 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI '08), pp. 3–10. IEEE Computer Society, Washington, D.C., 2008.
Ali, S. and Smith, K. A.. Kernel width selection for svm classification: A meta learning approach. International Journal of Data Warehousing Mining, 1:78–97, 2006.
Alpaydn, E.. Combined 52 f test for comparing supervised classification learning algorithms. Neural Computation, 11:1885–1892, 1999.
Andersson, A., Davidsson, P., and Linden, J.. Measure-based classifier performance evaluation. Pattern Recognition Letters, 20:1165–1173, 1999.
Armstrong, J. S.. Significance tests harm progress in forecasting. International Journal of Forecasting, 23:321–327, 2007.
Asuncion, A. and Newman, D. J.. UCI machine learning repository. University of California, Irvine, School of Information and Computer Science, 2007. URL: http://www.ics. uci.edu/ mlearn/MLRepository.html.
Bailey, T. L. and Elkan, C.. Estimating the accuracy of learned concepts. In Proceedings of the 1993 International Joint Conference on Artificial Intelligence, pp. 895–900. Morgan Kaufmann, San Mateo, CA, 1993.
Bay, S. D., Kibler, D., Pazzani, M. J., and Smyth, P.. The UCI KDD archive of large data sets for data mining researc and experimentation. SIGKDD Explorations, 2(2):81–85, December 2000.
Bellinger, C., Lalonde, J., Floyd, M. W., Mallur, V., Elkanzi, E., Ghazi, D., He, J., Mouttham, A., Scaiano, M., Wehbe, E., and Japkowicz, N.. An evaluation of the value added by informative metrics. In Proceedings of the Fourth Workshop on Evaluation Methods for Machine Learning, 2009.
Bennett, E. M., Alpert, R., and Goldstein, A. C.. Communications through limited response questioning. Public Opinion Q, 18:303–308, 1954.
Berry, K. J. and Mielke, P. W. Jr., A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurements, 48:921–933, 1988.
Blum, A., Kalai, A., and Langford, J.. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the 12th Annual Conference on Computational Learning Theory (COLT '99), pp. 203–208. Association for Computing Machinery, New York, 1999. doi: http://doi.acm.org/10.1145/307400.307439.
Bouckaert, R. R.. Choosing between two learning algorithms based on calibrated tests. In Fawcett, T. and Mishra, N., editors, Proceedings of the 20th International Conference on Machine Learning. American Association for Artificial Intelligence, Menlo Park, CA, 2003.
Bouckaert, R. R.. Estimating replicability of classifier learning experiments. In Brodley, C., editor, Proceedings of the 21st International Conference on Machine Learning. American Association for Artificial Intelligence, Menlo Park, CA, 2004.
Bousquet, O., Boucheron, S., and Lugosi, G.. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, pp. 169–207. Vol. 3176 of Springer Lecture Notes in Artificial Intelligence. Springer-Verlag, Berlin, 2004.
Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., and Brodley, C. E.. Pruning decision trees with misclassification costs. In Proceedings of the European Conference on Machine Learning, pp. 131–136. Springer, Berlin, 1998.
Bradley, P.. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:1145–1159, 1997.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J.. Classification and Regression Trees. Chapman & Hall, CRC, 1984.
Brodley, C. E.. Addressing the selective superiority problem: Automatic algorithm/model class selection. In Proceedings of the 10th International Conference on Machine Learning, pp. 17–24, Morgan Kaufmann, San Mateo, CA, 1993.
Buja, A., Stuetzle, W., and Shen, Y.. Loss functions for binary class probability estimation: Structure and applications. 2005. http://www-stat.wharton.upenn.edu/ buja/PAPERS/paper-proper-scoring.pdf.
Busemeyer, J. R. and Wang, Y. M.. Model comparisons and model selections based on generalization test methodology. Journal of Mathematical Psychology, 44:171–189, 2000.
Byrt, T., Bishop, J., and Carlin, J. B.. Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46:423–429, 1993.
Caruana, R. and Niculescu-Mizil, A.. Data mining in metric space: An empirical analysis of supervised learning performance criteria. In Proceedings of KDD. Association for Computing Machinery, New York, 2004.
Chernik, M. R.. Bootstrap Methods: A Guide for Practitioners and Researchers. 2nd ed. Wiley, New York, 2007.
Chow, S. L.. Precis of statistical significance: Rationale, validity, and utility. Behavioral And Brain Sciences, 21:169–239, 1998.
Ciraco, M., Rogalewski, M., and Weiss, G.. Improving classifier utility by altering the misclassification cost ratio. In Proceedings of the 1st International Workshop on Utility-Based Data Mining (UBDM '05), pp. 46–52. Association for Computing Machinery, New York, 2005.
Cohen, J.. A coefficient of agreement for nominal scales. Educational and Psychological Measurements, 20:37–46, 1960.
Cohen, J.. The earth is round (p!.05). American Psychologist, 49:997–1003, 1994.
Cohen, J.. The earth is round (p!.05). In Harlow, L. L. and Mulaik, S. A., editors, What If There Were No Significance Tests?Lawrence Erlbaum, Mahwah, NJ, 1997.
Cohen, P. R.. Empirical Methods for Artificial Intelligence. MIT Press, Cambridge, MA, 1995.
Conover, W. J.. Practical Nonparametric Statistics. 3rd ed. Wiley, New York, 1999.
Cortes, C. and Mohri, M.. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems, Vol. 16. MIT Press, Cambridge, MA, 2004.
Cortes, C. and Mohri, M.. Confidence intervals for the area under the ROC curve. In Advances in Neural Information Processing Systems, Vol. 17. MIT Press, Cambridge, MA, 2005.
Davis, J. and Goadrich, M.. The relationship between precision-recall and ROC curves. In Proceedings of the International Conference on Machine Learning, pp. 233–240. Association for Computing Machinery, New York, 2006.
Deeks, J. J. and Altman, D. G.. Diagnostic tests 4: Likelihood ratios. British Medical Journal, 329:168–169, 2004.
Demartini, G. and Mizzaro, S.. A classification of IR effectiveness metrics. In Proceedings of the European Conference on Information Retrieval, pp. 488–491. Vol. 3936 of Springer Lecture Notes. Springer, Berlin, 2006.
Demsar, J.. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.
Demsar, J.. On the appropriateness of statistical tests in machine learning. In Proceedings of the ICML'08 Third Workshop on Evaluation Methods for Machine Learning. Association for Computing Machinery, New York, 2008.
Dice, L. R.. Measures of the amount of ecologic association between species. Journal of Ecology, 26:297–302, 1945.
Dietterich, T. G.. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998.
Domingos, P.. A unified bias-variance decomposition and its applications. In Proceedings of the 17th International Conference on Machine Learning, pp. 231–238. Morgan Kaufmann, San Mateo, CA, 2000.
Drummond, C.. Machine learning as an experimental science (revised). In Proceedings of the AAAI'06 Workshop on Evaluation Methods for Machine Learning I. American Association for Artificial Intelligence, Menlo Park, CA, 2006.
Drummond, C.. Finding a balance between anarchy and orthodoxy. In Proceedings of the ICML'08 Third Workshop on Evaluation Methods for Machine Learning. Association for Computing Machinery, New York, 2008.
Drummond, C. and Japkowicz, N.. Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning. Journal of Experimental and Theoretical Artificial Intelligence, 22(1):67–80, 2010.
Efron, B.. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association, 78:316–331, 1983.
Efron, B. and Tibshirani, R. J.. An Introduction to the Bootstrap, Chapman and Hall, New York, 1993.
Elazmeh, W., Japkowicz, N., and Matwin, S.. A framework for measuring classification difference with imbalance. In Proceedings of the 2006 European Conference on Machine Learning (ECML/PKDD 2008). Springer, Berlin, 2006.
Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K.. Adacost: Misclassification cost-sensitive boosting. In Proceedings of the 16th International Conference on Machine Learning, pp. 97–105. Morgan Kaufmann, San Mateo, CA, 1999.
Fawcett, T.. ROC graphs: Notes and practical considerations for data mining researchers. Technical Note HPL 2003–4, Hewlett-Packard Laboratories, 2004.
Fawcett, T.. An introduction to ROC analysis. Pattern Recognition Letters, 27:861–874, 2006.
Fawcett, T. and Niculescu-Mizil, A.. PAV and the ROC convex hull. Machine Learning, 68 (1):97–106, 2007. doi: http://dx.doi.org/10.1007/s10994-007-5011-0.
Ferri, C., Flach, P. A., and Hernandez-Orallo, J.. Improving the AUC of probabilistic estimation trees. In Proceedings of the 14th European Conference on Machine Learning, pp. 121–132. Springer, Berlin, 2003.
Ferri, C., Haernandez-Orallo, J., and Modroiu, R.. An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30:27–38, 2009.
Fisher, R. A.. Statistical Methods and Scientific Inference. 2nd ed. Hafner, New York, 1959.
Fisher, R. A.. The Design of Experiments. 2nd ed. Hafner, New York, 1960.
Flach, P. A.. The geometry of ROC space: Understanding machine learning metrics through ROC isometrics. In Proceedings of the 20th International Conference on Machine Learning, pp. 194–201. American Association for Artificial Intelligence, Menlo Park, CA, 2003.
Flach, P. A. and Wu, S.. Repairing concavities in ROC curves. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI'05), pp. 702–707. Professional Book Center, 2005.
Fleiss, J. L.. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378–382, 1971.
Forman, G.. A method for discovering the insignificance of one's best classifier and the unlearnability of a classification task. In Proceedings of the First International Workshop on Data Mining Lessons Learned (DMLL-2002), 2002.
Forster, M. R.. Key concepts in model selection: Performance and generalizabilty. Journal of Mathematical Psychology, 44:205–231, 2000.
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y.. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.
Friedman, M.. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32:675–701, 1937.
Friedman, M.. A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11:86–92, 1940.
Fuernkranz, J. and Flach, P. A.. Roc 'n' rule learning – Towards a better understanding of covering algorithms. Machine Learning, 58:39–77, 2005.
Ganti, V., Gehrke, J., Ramakrishnan, R., and Loh, W. Y.. A framework for measuring differences in data characteristics. Journal of Computer and System Sciences, 64:542–578, 2002.
Gardner, M. and Altman, D. G.. Confidence intervals rather than p values: Estimation rather than hypothesis testing. British Medical Journal, 292:746–750, 1986.
Gaudette, L. and Japkowicz, N.. Evaluation methods for ordinal classification. In Proceedings of the 2009 Canadian Conference on Artificial Intelligence. Springer, New York, 2009.
Geng, L. and Hamilton, H.. Choosing the right lens: Finding what is interesting in data mining. In Guillet, F. and Hamilton, H. J., editors, Quality Measures in Data Mining, pp. 3–24. Vol. 43 of Springer Studies in Computational Intelligence Series, Springer, Berlin, 2007.
Gigerenzer, G.. Mindless statistics. Journal of Socio-Economics, 33:587–606, 2004.
Gill, J. and Meir, K.. The insignificance of null hypothesis significance testing. Political Research Quarterly, pp. 647–674, 1999.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S.. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999.
Goodman, S. N.. A comment on replication, p-values and evidence. Statistics in Medicine, 11:875–879, 2007.
Gosset, W. S. (pen name: Student). The probable error of a mean. Biometrika, 6:1–25, 1908.
Gwet, K.. Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Statistical Methods for Inter-Rater Reliability Assessment Series, 1:1–6, 2002a.
Gwet, K.. Inter-rater reliability: Dependency on trait prevalence and marginal homogeneity. Statistical Methods for Inter-Rater Reliability Assessment Series, 2:1–9, 2002b.
Hand, D. J.. Classifier technology and the illusion of progress. Statistical Science, 21:1–15, 2006.
Hand, D. J.. Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77:103–123, 2009.
Hand, D. J. and Till, R. J.. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45:171–186, 2001.
Hanley, J. A. and McNeil, B. J.. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29–36, 1982.
Harlow, L. L. and Mulaik, S. A., editors. What If There Were No Significance Tests?Lawrence Erlbaum, Mahwah, NJ, 1997.
Hastie, T., Tibshirani, R., and Friedman, J.. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York, 2001.
He, J., Tan, A. H., Tan, C. L., and Sung, S. Y.. On quantitative evaluation of clustering systems. In Wu, W. and Xiong, H., editors, Information Retrieval and Clustering. Kluwer Academic, Dordrecht, The Netherlands, 2002.
He, X. and Frey, E. C.. The meaning and use of the volume under a three-class ROC surface (vus). IEEE Transactions Medical Imaging, 27:577–588, 2008.
Herbrich, R.. Learning Kernel Classifiers. MIT Press, Cambridge, MA, 2002.
Hill, T. and Lewicki, P.. STATISTICS Methods and Applications. StatSoft, Tulsa, OK, 2007.
Hinton, P.. Statistics Explained. Routledge, London, 1995.
Holm, S.. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979.
Holte, R. C.. Very simple classification rules perform well on most commonly used data sets. Machine Learning, 11:63–91, 1993.
Hommel, G.. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75:383–386, 1988.
Hope, L. R. and Korb, K. B.. A Bayesian metric for evaluating machine learning algorithms. In Australian Conference on Artificial Intelligence, pp. 991–997. Vol. 3399 of Springer Lecture Notes in Computer Science. Springer, New York, 2004.
Howell, D. C.. Statistical Methods for Psychology. 5th ed. Duxbury Press, Thomson Learning, 2002.
Howell, D. C.. Resampling Statistics: Randomization and the Bootstrap. On-Line Notes, 2007. URL http://www.uvm.edu/ dhowell/StatPp./Resampling/Resampling.html.
Huang, J. and Ling, C. X.. Constructing new and better evaluation measures for machine learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07), pp. 859–864, 2007.
Huang, J., Ling, C. X., Zhang, H., and Matwin, S.. Proper model selection with significance test. In Proceedings of the European Conference on Machine Learning (ECML-2008), pp. 536–547. Springer, Berlin, 2008.
Hubbard, R. and Lindsay, R. M.. Why p values are not a useful measure of evidence in statistical significance testing. Theory and Psychology, 18:69–88, 2008.
Ioannidis, J. P. A.. Why most published research findings are false. Public Library of Science Medicine, 2(8):e124, 2005.
Jaccard, P.. The distribution of the flora in the alpine zone. New Phytology, 11(2):37–50, 1912.
Jain, A. K., Dubes, R. C., and Chen, C.. Bootstrap techniques for error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9:628–633, 1987.
Japkowicz, N.. Classifier evaluation: A need for better education and restructuring. In Proceedings of the ICML'08 Third Workshop on Evaluation Methods for Machine Learning, July 2008.
Japkowicz, N., Sanghi, P., and Tischer, P.. A projection-based framework for classifier performance evaluation. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD '08) – Part I, pp. 548–563. Springer-Verlag, Berlin, 2008.
Jensen, D. and Cohen, P.. Multiple comparisons in induction algorithms. Machine Learning, 38:309–338, 2000.
Jin, H. and Lu, Y.. Permutation test for non-inferiority of the linear to the optimal combination of multiple tests. Statistics and Probability Letters: 79:664–669, 2009.
Kendall, M.. A new measure of rank correlation. Biometrika, 30:81–89, 1938.
Kibler, D. F. and Langley, P.. Machine learning as an experimental science. In Proceedings of the Third European Working Session on Learning (EWSL), pp. 81–92. Pitman, New York, 1988.
Klement, W.. Evaluating machine learning methods: Scored receiver operating characteristics (sROC) curves. Ph.D. thesis, SITE, University of Ottawa, Canada, May 2010.
Kohavi, R.. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI '95), pp. 1137–1143. Morgan Kaufmann, San Mateo, CA, 1995.
Kononenko, I. and Bratko, I.. Information-based evaluation criterion for classifier's performance. Machine Learning, 6:67–80, 1991.
Kononenko, I. and Kukar, M.. Machine Learning and Data Mining: Introduction to Principles and Algorithms. Horwood, Chichester, UK, 2007.
Kraemer, H. C.. Ramifications of a population model for κ as a coefficient of reliability. Psychometrika, 44:461–472, 1979.
Kruskal, W. J. and Wallis, W. A.. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47:583–621, 1952.
Kubat, M., Holte, R. C., and Matwin, S.. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30:195–215, 1998.
Kukar, M., Kononenko, I., and Ljubljana, S.. Reliable classifications with machine learning. In Proceedings of 13th European Conference on Machine Learning (ECML 2002), pp. 219–231. Springer, Berlin, 2002.
Kukar, M. Z. and Kononenko, I.. Cost-sensitive learning with neural networks. In Proceedings of the 13th European Conference on Artificial Intelligence (ECAI-98), pp. 445–449. Wiley, New York, 1998.
Kuncheva, L. I., Whitaker, C. J., Shipp, C. A., and Duin, R. P. W.. Limits on the majority vote accuracy in classifier fusion. Pattern Analysis and Applications, 6:22–31, 2003.
Kurtz, A. K.. A research test of Rorschach test. Personnel Psychology, 1:41–53, 1948.
Lachiche, N. and Flach, P.. Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In Proceedings of the 20th International Conference on Machine Learning, pp. 416–423. American Association for Artificial Intelligence, Menlo Park, CA, 2003.
LaLoudouana, D. and Tarare, M. B.. Data set selection. In Proceedings of the Neural Information Processing System Workshop. MIT Press, Cambridge, MA, 2003.
Landgrebe, T., Pacl'ik, P., Tax, D. J. M., Verzakov, S., and Duin, R. P. W.. Cost-based classifier evaluation for imbalanced problems. In Proceedings of the 10th International Workshop on Structural and Syntactic Pattern Recognition and 5th International Workshop on Statistical Techniques in Pattern Recognition, pp. 762–770. Vol. 3138 of Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin, 2004.
Langford, J.. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 3:273–306, 2005.
Lavesson, N. and Davidsson, P.. Towards application-specific evaluation metrics. In Proceedings of the Third Workshop on Evaluation Methods for Machine Learning (ICML'2008). 2008a.
Lavesson, N. and Davidsson, P.. Generic methods for multi-criteria evaluation. In Proceedings of the Eighth SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, Philadelphia, 2008b.
Laviolette, F., Marchand, M., and Shah, M.. Margin-sparsity trade-off for the set covering machine. In Proceedings of the 16th European Conference on Machine Learning (ECML 2005), pp. 206–217. Vol. 3720 of Springer Lecture Notes in Artificial Intelligence. Springer, Berlin, 2005.
Laviolette, F., Marchand, M., Shah, M., and Shanian, S.. Learning the set covering machine by bound minimization and margin-sparsity trade-off. Machine Learning, 78(1-2):275–301, 2010.
Lavrač, N., Flach, P., and Zupan, B.. Rule evaluation measures: A unifying view. In Dzeroski, S. and Flach, P., editors, Ninth International Workshop on Inductive Logic Programming (ILP '99), pp. 174–185. Vol. 1634 of Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin, 1999.
Lebanon, G. and Lafferty, J. D.. Cranking: Combining rankings using conditional probability models on permutations. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML '02), pp. 363–370. Morgan Kaufmann, San Mateo, CA, 2002.
Li, M. and Vitanyi, P.. An Introduction to Kolmogorov Complexity and Its Applications. 2nd ed. Springer-Verlag, New York, 1997.
Lindley, D. V. and Scott, W.F.. New Cambridge Statistical Tables. 2nd ed. Cambridge University Press, New York, 1984.
Ling, C. X., Huang, J., and Zhang, H.. AUC: A statistically consistent and more discriminating measure than accuracy. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI '03), pp. 519–526. Morgan Kaufmann, San Mateo, CA, 2003.
Liu, X. Y. and Zhou, Z. H.. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18:63–77, 2006.
Macskassy, S. A., Provost, F., and Rosset, S.. Pointwise ROC confidence bounds: An empirical evaluation. In Proceedings of the Workshop on ROC Analysis in Machine Learning (ROCML-2005) at ICML '05. 2005.
Marchand, M. and Shah, M.. PAC-Bayes learning of conjunctions and classification of geneexpression data. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems, Vol. 17, pp. 881–888. MIT Press, Cambridge, MA, 2005.
Marchand, M. and Shawe-Taylor, J.. The set covering machine. Journal of Machine Learning Reasearch, 3:723–746, 2002.
Margineantu, D. D. and Dietterich, T. G.. Bootstrap methods for the cost-sensitive evaluation of classifiers. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 583–590. Morgan Kaufmann, San Mateo, CA, 2000.
Marrocco, C., Duin, R. P. W., and Tortorella, F.. Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition, 41:1961–1974, 2008.
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M.. The DET curve in assessment of detection task performance. Eurospeech, 4:1895–1898, 1997.
Meehl, P. E.. Theory testing in psychology and physics: A methodological paradox. Philosophy of Science, 34:103–115, 1967.
Melnik, O., Vardi, Y., and Zhang, C.. Mixed group ranks: Preference and confidence in classifier combination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:973–981, 2004.
Micheals, R. J. and Boult, T. E.. Efficient evaluation of classification and recognition systems. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition: IEEE Computer Society, pp. 50–57. Washington, DC, 2001.
Mitchell, T.. Machine Learning. McGraw-Hill, New York, 1997.
Murua, A.. Upper bounds for error rates of linear combinations of classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:591–602, 2002. doi: http://dx.doi.org/10.1109/34.1000235.
Nadeau, C. and Bengio, Y.. Inference for the generalization error. Machine Learning, 52:239–281, 2003.
Nakhaeizadeh, G. and Schnabl, A.. Development of multi-criteria metrics for evaluation of data mining algorithms. In Proceedings of KDD-97, pp. 37–42. American Association for Artificial Intelligence, Menlo Park, CA, 1997.
Nakhaeizadeh, G. and Schnabl, A.. Towards the personalization of algorithms evaluation in data mining. In Proceedings of KDD-98, pp. 289–293. American Association for Artificial Intelligence, Menlo Park, CA, 1998.
Narasimhamurthy, A. M.. Theoretical bounds of majority voting performance for a binary classification problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:1988–1995, 2005. doi: http://dx.doi.org/10.1109/TPAMI.2005.249.
Narasimhamurthy, A. M. and Kuncheva, L. I.. A framework for generating data to simulate changing environments. In Proceedings of the 2007 Conference on Artificial Intelligence and Applications, pp. 415–420. ACTA Press, 2007.
Neville, J. and Jensen, D.. A bias/variance decomposition for models using collective inference. Machine Learning, 73:87–106, 2008. doi: http://dx.doi.org/10.1007/s10994-008-5066-6.
O'Brien, D. B., Gupta, M. R., and Gray, R. M.. Cost-sensitive multi-class classification from probability estimates. In ICML 08: Proceedings of the 25th International Conference on Machine Learning (ICML '08), pp. 712–719. Association for Computing achinery, New York, 2008.
Provost, F. and Domingos, P.. Tree induction for probability-based ranking. Machine Learning, 52:199–215, 2003. doi: http://dx.doi.org/10.1023/A:1024099825458.
Provost, F., Fawcett, T., and Kohavi, R.. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann, San Mateo, CA, 1998.
Quenouille, M.. Approximate tests of correlation in time series. Journal of the Royal Statistical Society Series B, 11:18–84, 1949.
,R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2010. URL http://www.R-project.org.
Reich, Y. and Barai, S. V.. Evaluating machine learning models for engineering problems. Artificial Intelligence in Engineering, 13:257–272, 1999.
Rendell, L. and Cho, H.. Empirical learning as a function of concept character. Machine Learning, 5:267–298, 1990.
ROCR. Germany, 2007. Web: http://rocr.bioinf.mpi sb.mpg.de/.
Rosset, S.. Model selection via the auc. In Proceedings of the 21st International Conference on Machine Learning. Association for Computing Machinery, New York, 2004.
Rosset, S., Perlich, C., and Zadrozny, B.. Ranking-based evaluation of regression models. Knowledge and Information Systems, 12(3):331–353, 2007.
Sahiner, B., Chan, H., and Hadjiiski, L.. Classifier performance estimation under the constraint of a finite sample size: Resampling schemes applied to neural network classifiers. Neural Networks, 21:476–483, 2008.
Saitta, L. and Neri, F.. Learning in the “real world.” 1998 special issue: applications of machine learning and the knowledge discovery process. Machine Learning, 30:133–163, 1998.
Salzberg, S. L.. On comparing classifiers: Pitfalls to avoid and a recommeded approach. Data Mining and Knowledge Discovery, 1:317–327, 1997.
Santos-Rodríguez, R., Guerrero-Curieses, A., Alaiz-Rodríguez, R., and Cid-Sueiro, J.. Costsensitive learning based on Bregman divergences. Machine Learning, 76:271–285, 2009. doi: http://dx.doi.org/10.1007/s10994-009-5132-8.
Schmidt, F. L.. Statistical significance testing and cumulative knowledge in psychology. Psychological Methods, 1:115–129, 1996.
Schouten, H. J. A.. Measuring pairwise interobserver agreement when all subjects are judged by the same observers. Statistica Neerlandica, 36:45–61, 1982.
Scott, W. A.. Reliability of content analysis: The case of nominal scale coding. Public Opinion Q, 19:321–325, 1955.
Shah, M.. Sample Compression, Margins and Generalization: Extensions to the Set Covering Machine. Ph.D. thesis, SITE, University of Ottawa, Canada, May 2006.
Shah, M.. Sample compression bounds for decision trees. In Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 799–806. Association for Computing Machinery, New York, 2007. doi: http://doi.acm.org/10.1145/1273496.1273597.
Shah, M.. Risk bounds for classifier evaluation: Possibilities and challenges. In Proceedings of the 3rd Workshop on Evaluation Methods for Machine Learning at ICML-2008. 2008.
Shah, M. and Shanian, S.. Hold-out risk bounds for classifier performance evaluation. In Proceedings of the 4th Workshop on Evaluation Methods for Machine Learning at ICML '09. 2009.
Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T.. ROCR: Visualizing classifier performance in R. Bioinformatics, 21:3940–3941, 2005.
Smith-Miles, K. A.. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Computing Surveys, 41(1):article 6, 2008.
Soares, C.. Is the UCI repository useful for data mining? In Pires, F. M. and Abreu, S., editors, Proceedings of the 11th Portuguese Conference on Artificial Intelligence (EPIA '03), pp. 209–223. Vol. 2902 of Springer Lecture Notes in Artificial Intelligence. Springer, Berlin, 2003.
Soares, C., Costa, J., and Bradzil, P.. A simple and intuitive mesure for multicriteria evaluation of classification algorithms. In Proceedings of the ECML 2000 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pp. 87–96. Springer, Berlin, 2000.
Sonnenburg, S., Braun, M. L., Ong, C. S., Bengio, S., Bottou, L., Holmes, G., LeCun, Y., Mller, K., Pereira, F., Rasmussen, C. E., Ratsch, G., Scholkopf, B., Smola, A., Vincent, P., Weston, J., and Williamson, R.. The need for open source software in machine learning. Journal of Machine Learning Research, 8:2443–2466, 2007.
Spearman, C.. The proof and measurement of association between two things. American Journal of Psychology, 15:72–101, 1904.
,StatSoft Inc. Electronic Statistics Textbook. URL: http://www.statsoft.com/textbook/stathome.html.
Stocki, T., Japkowicz, N., Ungar, K., Hoffman, I., Yi, J., Li, G., and Siebes, A., editors. Proceedings of the Data Mining Contest, Eighth International Conference on Data Mining. IEEE Computer Society, Washington, D.C., 2008.
Vanderlooy, S. and Hüllermeier, E.. A critical analysis of variants of the AUC. Machine Learning, 72(3):247–262, 2008. doi: http://dx.doi.org/10.1007/s10994-008-5070-x.
Vapnik, V. and Chapelle, O.. Bounds on Error Expectation for Support Vector Machines. Neural Computation, 12:2013–2036, 2000.
Webb, G. I.. Discovering significant patterns. Machine Learning, 68:1–33, 2007. doi: http://dx.doi.org/10.1007/s10994-007-5006-x.
Weiss, S. M. and Kapouleas, I.. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. In Proceedings of the 11th International Joint Conference on Artificial Intelligence (IJCAI '89), pp. 781–787, Morgan Kaufmann, San Mateo, CA, 1989.
Weiss, S. M. and Kulikowski, C. A.. Computer Systems That Learn. Morgan Kaufmann, San Mateo, CA, 1991.
Wilcoxon, F.. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945.
Witten, I. H. and Frank, E.. Weka 3: Data Mining Software in Java. 2005a. http://www.cs.waikato.ac.nz/ml/weka/.
Witten, I. H. and Frank, E.. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Mateo, CA, 2005b.
Wolpert, D. H.. The lack of a priori distinctions between learning algorithms. Neural Computing, 8:1341–1390, 1996.
Wolpert, D. H. and Macready, W. G.. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1:67–82, 1997.
Wu, S., Flach, P. A., and Ferri, C.. An improved model selection heuristic for AUC. In Proceedings of the 18th European Conference on Machine Learning, Vol. 4701, pp. 478–487. Springer, Berlin, 2007.
Yan, L., Dodier, R., Mozer, M. C., and Wolniewicz, R.. Optimizing classifier performance via the Wilcoxon–Mann–Whitney statistic. In The Proceedings of the International Conference on Machine Learning (ICML), pp. 848–855. American Association for Artificial Intelligence, Menlo Park, CA, 2003.
Yousef, W. A., Wagner, R. F., and Loew, M. H.. Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recognition Letters, 26:2600–2610, 2005.
Yousef, W. A., Wagner, R. F., and Loew, M. H.. Assessing classifiers from two independent data sets using ROC analysis: A nonparametric approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:1809–1817, 2006.
Yu, C. H.. Resampling methods: Concepts, applications, and justification. Practical Assessment, Research and Evaluation, 8(19), 2003.
Zadrozny, B. and Elkan, C.. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '02), pp. 694–699. Association for Computing Machinery, New York, 2002. doi: http://doi.acm.org/10.1145/775047.775151.
Zadrozny, B., Langford, J., and Abe, N.. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining, p. 435. IEEE Computer Society, Washington, D.C., 2003.