Density Ratio Estimation in Machine Learning

Masashi Sugiyama; Taiji Suzuki; Takafumi Kanamori

doi:10.1017/CBO9781139035613

References

Bibliography

References

Agakov, F., and Barber, D. 2006. Kernelized Infomax Clustering. Pages 17—24 of: Weiss, Y., Schölkopf, B., and Platt, J. (eds), Advances in Neural Information Processing Systems 18.Cambridge, MA: MIT Press.

Aggarwal, C. C., and Yu, P. S. (eds). 2008. Privacy-Preserving Data Mining: Models and Algorithms.New York: Springer.

Akaike, H. 1970. Statistical Predictor Identification. Annals of the Institute of Statistical Mathematics, 22, 203–217.

Akaike, H. 1974. A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control, AC-19(6), 716–723.

Akaike, H. 1980. Likelihood and the Bayes Procedure. Pages 141—166 of: Bernardo, J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M. (eds), Bayesian Statistics.Valencia, Spain: Valencia University Press.

Akiyama, T., Hachiya, H., and Sugiyama, M. 2010. Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning. Neural Networks, 23(5), 639—648.

Ali, S. M., and Silvey, S. D. 1966. A General Class of Coefficients of Divergence of One Distribution from Another. Journal of the Royal Statistical Society, Series B, 28(1), 131—142.

Amari, S. 1967. Theory of Adaptive Pattern Classifiers. IEEE Transactions on Electronic Computers, EC-16(3), 299–307. Amari, S. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251–276.

Amari, S. 2000. Estimating Functions of Independent Component analysis for Temporally Correlated Signals. Neural Computation, 12(9), 2083–2107.

Amari, S., and Nagaoka, H. 2000. Methods of Information Geometry.Providence, RI: Oxford University Press.

Amari, S., Fujita, N., and Shinomoto, S. 1992. Four Types of Learning Curves. Neural Computation, 4(4), 605–618.

Amari, S., Cichocki, A., and Yang, H. H. 1996. A New Learning Algorithm for Blind Signal Separation. Pages 757—763 of: Touretzky, D. S., Mozer, M., C., and Hasselmo, M. E. (eds), Advances in Neural Information Processing Systems 8.Cambridge, MA: MIT Press.

Anderson, N., Hall, P., and Titterington, D. 1994. Two-Sample Test Statistics for Measuring Discrepancies between Two Multivariate Probability Density Functions Using Kernel-based Density Estimates. Journal of Multivariate Analysis, 50, 41–54.

Ando, R. K., and Zhang, T. 2005. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research, 6, 1817—1853.

Antoniak, C. 1974. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. The Annals of Statistics, 2(6), 1152–1174.

Aronszajn, N. 1950. Theory of Reproducing Kernels. Transactions of the American Mathematical Society, 68, 337–404.

Bach, F., and Harchaoui, Z. 2008. DIFFRAC: A Discriminative and Flexible Framework for Clustering. Pages 49—56 of: Platt, J. C., Koller, D., Singer, Y., and Roweis, S. (eds), Advances in Neural Information Processing Systems 20.Cambridge, MA: MIT Press.

Bach, F., and Jordan, M. I. 2002. Kernel Independent Component Analysis. Journal of Machine Learning Research, 3, 1–8.

Bach, F., and Jordan, M. I. 2006. Learning Spectral Clustering, with Application to Speech Separation. Journal of Machine Learning Research, 7, 1963—2001.

Bachman, G., and Narici, L. 2000. Functional Analysis.Mineola, NY: Dover Publications.

Bakker, B., and Heskes, T. 2003. Task Clustering and Gating for Bayesian Multitask Learning. Journal of Machine Learning Research, 4, 83—99.

Bartlett, P., Bousquet, O., and Mendelson, S. 2005. Local Rademacher Complexities. The Annals of Statistics, 33, 1487–1537.

Basseville, M., and Nikiforov, V. 1993. Detection of Abrupt Changes: Theory and Application. Englewood Cliffs, NJ: Prentice-Hall, Inc.

Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. 1998. Robust and Efficient Estimation by Minimising a Density Power Divergence. Biometrika, 85(3), 549–559.

Baxter, J. 1997. A Bayesian/Information Theoretic Model of Learning to Learn via Mutiple Task Sampling. Machine Learning, 28, 7—39.

Baxter, J. 2000. A Model of Inductive Bias Learning. Journal of Artificail Intelligence Research, 12, 149–198.

Belkin, M., and Niyogi, P. 2003. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6), 1373—1396.

Bellman, R. 1961. Adaptive Control Processes: A Guided Tour.Princeton, NJ: Princeton University Press.

Ben-David, S., and Schuller, R. 2003. Exploiting Task Relatedness for Multiple Task Learning. Pages 567-580 of: Proceedings of the Sixteenth Annual Conference on Learning Theory (COLT2003).

Ben-David, S., Gehrke, J., and Schuller, R. 2002. A Theoretical Framework for Learning from a Pool of Disparate Data Sources. Pages 443–49 of: Proceedings of The Eighth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2002).

Bensaid, N., and Fabre, J. P. 2007. Optimal Asymptotic Quadratic Error of Kernel Estimators of Radon-Nikodym Derivatives for Strong Mixing Data. Journal of Nonparametric Statistics, 19(2), 77–88.

Bertsekas, D., Nedic, A., and Ozdaglar, A. 2003. Convex Analysis and Optimization.Belmont, MA: Athena Scientific.

Best, M. J. 1982. An Algorithm for the Solution of the Parametric Quadratic Programming Problem. Tech. rept. 82–24. Faculty of Mathematics, University of Waterloo.

Biau, G., and Györfi, L. 2005. On the Asymptotic Properties of a Nonparametric l1-test Statistic of Homogeneity. IEEE Transactions on Information Theory, 51(11), 3965—3973.

Bickel, P. 1969. A Distribution Free Version of the Smirnov Two Sample Test in the p-variate Case. The Annals of Mathematical Statistics, 40(1), 1–23.

Bickel, S., Brückner, M., and Scheffer, T. 2007. Discriminative Learning for Differing Training and Test Distributions. Pages 81—88 of: Proceedings of the 24th International Conference on Machine Learning (ICML2007).

Bickel, S., Bogojeska, J., Lengauer, T., and Scheffer, T. 2008. Multi-Task Learning for HIV Therapy Screening. Pages 56-63 of: McCallum, A., and Roweis, S. (eds), Proceedings of 25th Annual International Conference on Machine Learning (ICML2008).

Bishop, C. M. 1995. Neural Networks for Pattern Recognition.Oxford, UK: Clarendon Press.

Bishop, C. M. 2006. Pattern Recognition and Machine Learning.New York: Springer.

Blanchard, G., Kawanabe, M., Sugiyama, M., Spokoiny, V., and Müller, K.-R. 2006. In Search of Non-Gaussian Components of a High-dimensional Distribution. Journal of Machine Learning Research, 7(Feb.), 247–282.

Blei, D.M., and Jordan, M.I. 2006. Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis, 1(1), 121–144.

Bolton, R. J., and Hand, D. J. 2002. Statistical Fraud Detection: A Review. Statistical Science, 17(3), 235–255.

Bonilla, E., Chai, K. M., and Williams, C. 2008. Multi-Task Gaussian Process Prediction. Pages 153-160 of: Platt, J. C, Roller, D., Singer, Y., and Roweis, S. (eds), Advances in Neural Information Processing Systems 20.Cambridge, MA: MIT Press.

Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J. 2006. Integrating Structured Biological Data by Kernel Maximum Mean Discrepancy. Bioinformatics, 22(14), e49-e57.

Bousquet, O. 2002. A Bennett Concentration Inequality and its Application to Suprema of Empirical Process. Note aux Compte Rendus de l'Académie des Sciences de Paris, 334, 495–500.

Boyd, S., and Vandenberghe, L. 2004. Convex Optimization.Cambridge, UK: Cambridge University Press.

Bradley, A. P. 1997. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30(7), 1145—1159.

Bregman, L. M. 1967. The Relaxation Method of Finding the Common Point of Convex Sets and Its Application to the Solution of Problems in Convex Programming. USSR Computational Mathematics and Mathematical Physics, 7, 200—217.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. LOF: Identifying Density-Based Local Outliers. Pages 93-104 of: Chen, W., Naughton, J. F., and Bernstein, P. A. (eds), Proceedings of the ACM SIGMOD International Conference on Management of Data.

Brodsky, B., and Darkhovsky, B. 1993. Nonparametric Methods in Change-Point Problems. Dordrecht, the Netherlands: Kluwer Academic Publishers. Broniatowski, M., and Keziou, A. 2009. Parametric Estimation and Tests through Divergences and the Duality Technique. Journal of Multivariate Analysis, 100, 16–26.

Buhmann, J. M. 1995. Data Clustering and Learning. Pages 278-281 of: Arbib, M. A. (ed), The Handbook of Brain Theory and Neural Networks.Cambridge, MA: MIT Press.

Bura, E., and Cook, R. D. 2001. Extending Sliced Inverse Regression. Journal of the American Statistical Association, 96(455), 996–1003.

Caponnetto, A., and de Vito, E. 2007. Optimal Rates for Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3), 331—368.

Cardoso, J.-F. 1999. High-Order Contrasts for Independent Component Analysis. Neural Computation, 11(1), 157–192.

CardosoJ.-F., J.-F., and Souloumiac, A. 1993. Blind Beamforming for Non-Gaussian Signals. Radar and Signal Processing, IEE Proceedings-F, 140(6), 362—370.

Caruana, R., Pratt, L., and Thrun, S. 1997. Multitask Learning. Machine Learning, 28, 41–75.

Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, Learning, and Games.Cambridge, UK: Cambridge University Press.

Chan, J., Bailey, J., and Leckie, C. 2008. Discovering Correlated Spatio-Temporal Changes in Evolving Graphs. Knowledge and Information Systems, 16(1), 53–96.

Chang, C. C., and Lin, C. J. 2001. LIBSVM: A Library for Support Vector Machines. Tech. rept. Department of Computer Science, National Taiwan University. http://www.csie.ntu.edu.tw/∼cjlin/libsvm/.

Chapelle, O., Schölkopf, B., and Zien, A. (eds). 2006. Semi-Supervised Learning.Cambridge, MA: MIT Press.

Chawla, N. V., Japkowicz, N., and Kotcz, A. 2004. Editorial: Special Issue on Learning from Imbalanced Data Sets. ACMSIGKDD Explorations Newsletter, 6(1), 1–6. Chen, S.-M., Hsu, Y.-S., and Liaw, J.-T. 2009. On Kernel Estimators of Density Ratio. Statistics, 43(5), 463–79.

Chen, S. S., Donoho, D. L., and Saunders, M. A. 1998. Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing, 20(1), 33—61.

Cheng, K. F., and Chu, C. K. 2004. Semiparametric Density Estimation under a Two-sample Density Ratio Model. Bernoulli, 10(4), 583–604.

Chiaromonte, F., and Cook, R. D. 2002. Sufficient Dimension Reduction and Graphics in Regression. Annals of the Institute of Statistical Mathematics, 54(4), 768—795.

Cichocki, A., and Amari, S. 2003. Adaptive Blind Signal and Image Processing: LearningAlgorithms and Applications.New York: Wiley.

Cohn, D. A., Ghahramani, Z., and Jordan, M. I. 1996. Active Learning with Statistical Models. Journal of Artificial Intelligence Research, 4, 129—145.

Collobert, R., and Bengio., S. 2001. SVMTorch: Support Vector Machines for Large-Scale Regression Problems. Journal of Machine Learning Research, 1, 143–160.

Comon, P. 1994. Independent Component Analysis, A New Concept? Signal Processing, 36(3), 287–314.

Cook, R. D. 1998a. Principal Hessian Directions Revisited. Journal of the American Statistical Association, 93(441), 84–100.

Cook, R. D. 1998b. Regression Graphics: Ideas for Studying Regressions through Graphics.New York: Wiley.

Cook, R. D. 2000. SAVE: A Method for Dimension Reduction and Graphics in Regression. Communications in Statistics-Theory and Methods, 29(9), 2109–2121.

Cook, R. D., and Forzani, L. 2009. Likelihood-Based Sufficient Dimension Reduction. Journal of the American Statistical Association, 104(485), 197–208.

Cook, R. D., and Ni, L. 2005. Sufficient Dimension Reduction via Inverse Regression. Journal of the American Statistical Association, 100(470), 410–428.

Cortes, C., and Vapnik, V. 1995. Support-Vector Networks. Machine Learning, 20, 273–297.

Cover, T. M., and Thomas, J. A. 2006. Elements of Information Theory. 2nd edn. Hoboken, NJ: Wiley.

Cramér, H. 1946. Mathematical Methods of Statistics.Princeton, NJ: Princeton University Press.

Craven, P., and Wahba, G. 1979. Smoothing Noisy Data with Spline Functions: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-Validation. Numerische Mathematik, 31, 377–403.

Csiszár, I. 1967. Information-Type Measures of Difference of Probability Distributions and Indirect Observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229–318.

Cwik, J., and Mielniczuk, J. 1989. Estimating Density Ratio with Application to Discriminant Analysis. Communications in Statistics: Theory and Methods, 18(8), 3057—3069.

Darbellay, G. A., and Vajda, I. 1999. Estimation of the Information by an Adaptive Partitioning of the Observation Space. IEEE Transactions on Information Theory, 45(4), 1315–1321.

Davis, J., Kulis, B., Jain, P., Sra, S., and Dhillon, I. 2007. Information-Theoretic Metric Learning. Pages 209—216 of: Ghahramani, Z. (ed), Proceedings of the 24th Annual International Conference on Machine Learning (ICML2007).

Demmel, J. W.1997. Applied Numerical Linear Algebra.Philadelphia, PA: Society for Industrial and Applied Mathematics.

Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, series B, 39(1), 1—38.

Dhillon, I. S., Guan, Y., and Kulis, B. 2004. Kernel K-Means, Spectral Clustering and Normalized Cuts. Pages 551-556 of: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York: ACM Press.

Donoho, D. L., and Grimes, C. E. 2003. Hessian Eigenmaps: Locally Linear Embedding Techniques for High-Dimensional Data. Pages 5591-5596 of: Proceedings of the National Academy of Arts and Sciences.

Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. 2nd edn. New York: Wiley.

Duffy, N., and Collins, M. 2002. Convolution Kernels for Natural Language. Pages 625-632 of: Dietterich, T. G., Becker, S., and Ghahramani, Z. (eds), Advances in Neural Information Processing Systems 14.Cambridge, MA: MIT Press.

Durand, J., and Sabatier, R. 1997. Additive Splines for Partial Least Squares Regression. Journal of the American Statistical Association, 92(440), 1546–1554.

Edelman, A. 1988. Eigenvalues and Condition Numbers of Random Matrices. SIAM Journal on Matrix Analysis and Applications, 9(4), 543—560.

Edelman, A., and Sutton, B. D. 2005. Tails of Condition Number Distributions. SIAM Journal on Matrix Analysis and Applications, 27(2), 547–560.

Edelman, A., Arias, T. A., and Smith, S. T. 1998. The Geometry of Algorithms with Orthogonality Constraints. SIAM Journal on Matrix Analysis and Applications, 20(2), 303—353.

Efron, B. 1975. The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis. Journal of the American Statistical Association, 70(352), 892–898.

Efron, B., and Tibshirani, R. J. 1993. An Introduction to the Bootstrap.New York: Chapman & Hall/CRC.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. 2004. Least Angle Regression. The Annals of Statistics, 32(2), 407–499.

Elkan, C. 2011. Privacy-Preserving Data Mining via Importance Weighting. In C, Dimitrakakis, A, Gkoulalas-Divanis, A, Mitrokotsa, V. S, Verykios, and Y., Saygin (Eds.): Privacy and Security Issues in Data Mining and Machine Learning, 15—21, Berlin: Springer.

Evgeniou, T., and Pontil, M. 2004. Regularized Multi-Task Learning. Pages 109-117 of: Proceedings of the Tenth A CM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2004).

Faivishevsky, L., and Goldberger, J. 2009. ICA based on a Smooth Estimation of the Differential Entropy. Pages 433-440 of: Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds), Advances in Neural Information Processing Systems 21.Cambridge, MA: MIT Press.

Faivishevsky, L., and Goldberger, J. 2010 (Jun. 21—25). A Nonparametric Information Theoretic Clustering Algorithm. Pages 351—358 of: Joachims, A. T., and Fürnkranz, J. (eds), Proceedings of 27th International Conference on Machine Learning (ICML2010).

Fan, H., Zaïane, O. R., Foss, A., and Wu, J. 2009. Resolution-Based Outlier Factor: Dtecting the Top-n Most Outlying Data Points in Engineering Data. Knowledge and Information Systems, 19(1), 31–51.

Fan, J., Yao, Q., and Tong, H. 1996. Estimation of Conditional Densities and Sensitivity Measures in Nonlinear Dynamical Systems. Biometrika, 83(1), 189–206.

Fan, R. -E., Chen, P.-H., and Lin, C.-J. 2005. Working Set Selection Using Second Order Information for Training SVM. Journal of Machine Learning Research, 6, 1889—1918.

Fan, R. -E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. 2008. LIBLINEAR: ALibrary for Large Linear Classification. Journal of Machine Learning Research, 9, 1871–1874.

Fedorov, V. V. 1972. Theory of Optimal Experiments. New York: Academic Press. Fernandez, E. A. 2005. The dprep Package. Tech. rept. University of Puerto Rico.

Feuerverger, A. 1993. A Consistent Test for Bivariate Dependence. International Statistical Review, 61(3), 419–433.

Fisher, R. A. 1936. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(2), 179–188.

Fishman, G. S. 1996. Monte Carlo: Concepts, Algorithms, and Applications.Berlin, Germany: Springer-Verlag.

Fokianos, K., Kedem, B., Qin, J., and Short, D. A. 2001. A Semiparametric Approach to the One-Way Layout. Technometrics, 43, 56—64.

Franc, V., and Sonnenburg, S. 2009. Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization. Journal of Machine Learning Research, 10, 2157–2192.

Fraser, A. M., and Swinney, H. L. 1986. Independent Coordinates for Strange Attractors from Mutual Information. Physical Review A, 33(2), 1134–1140.

Friedman, J., and Rafsky, L. 1979. Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests. The Annals of Statistics, 7(4), 697–717.

Friedman, J. H. 1987. Exploratory Projection Pursuit. Journal of the American Statistical Association, 82(397), 249–266.

Friedman, J. H., and Tukey, J. W. 1974. A Projection Pursuit Algorithm for Exploratory Data Analysis. IEEE Transactions on Computers, C-23(9), 881–890.

Fujimaki, R., Yairi, T., and Machida, K. 2005. An Approachh to Spacecraft Anomaly Detection Problem Using Kernel Feature Space. Pages 401—410 of: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2005).

Fujisawa, H., and Eguchi, S. 2008. Robust Parameter Estimation with a Small Bias against Heavy Contamination. Journal of Multivariate Analysis, 99(9), 2053–2081.

Fukumizu, K. 2000. Statistical Active Learning in Multilayer Perceptrons. IEEE Transactions on Neural Networks, 11(1), 17–26.

Fukumizu, K., Bach, F. R., and Jordan, M. I. 2004. Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces. Journal of Machine Learning Research, 5(Jan), 73–99.

Fukumizu, K., Bach, F. R., and Jordan, M. I. 2009. Kernel Dimension Reduction in Regression. The Annals of Statistics, 37(4), 1871–1905.

Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. 2nd edn. Boston, MA: Academic Press, Inc.

Fung, G. M., and Mangasarian, O. L. 2005. Multicategory Proximal Support Vector Machine Classifiers. Machine Learning, 59(1-2), 77–97.

Gao, J., Cheng, H., and Tan, P.-N. 2006a. A Novel Framework for Incorporating Labeled Examples into Anomaly Detection. Pages 593—597 of: Proceedings of the 2006 SIAMInternational Conference on Data Mining.

Gao, J., Cheng, H., and Tan, P.-N. 2006b. Semi-Supervised Outlier Detection. Pages 635-636 of: Proceedings of the 2006 ACM symposium on Applied Computing. Gärtner, T. 2003. A Survey of Kernels for Structured Data. SIGKDD Explorations, 5(1), S268–S275.

Gärtner, T., Flach, P., and Wrobel, S. 2003. On Graph Kernels: Hardness Results and Efficient Alternatives. Pages 129-143 of: Schölkopf, B., and Warmuth, M. (eds), Proceedings of the Sixteenth Annual Conference on Computational Learning Theory.

Ghosal, S., and van der Vaart, A. W. 2001. Entropies and Rates of Convergence for Maximum Likelihood and Bayes Estimation for Mixtures of Normal Densities. Annals of Statistics, 29, 1233–1263.

Globerson, A., and Roweis, S. 2006. Metric Learning by Collapsing Classes. Pages 451-58 of: Weiss, Y., Schölkopf, B., and Platt, J. (eds), Advances in Neural Information Processing Systems 18.Cambridge, MA: MIT Press.

Godambe, V. P. 1960. An Optimum Property of Regular Maximum Likelihood Estimation. Annals of Mathematical Statistics, 31, 1208–1211.

Goldberger, J., Roweis, S., Hinton, G., and Salakhutdinov, R. 2005. Neighbourhood Components Analysis. Pages 513—520 of: Saul, L. K., Weiss, Y., and Bottou, L. (eds), Advances in Neural Information Processing Systems 17.Cambridge, MA: MIT Press.

Golub, G. H., and Loan, C. F. Van. 1996. Matrix Computations.Baltimore, MD: Johns Hopkins University Press.

Gomes, R., Krause, A., and Perona, P. 2010. Discriminative Clustering by Regularized Information Maximization. Pages 766-77 4 of: Lafferty, J., Williams, C. K. I., Zemel, R., Shawe-Taylor, J., and Culotta, A. (eds), Advances in Neural Information Processing Systems 23.Cambridge, MA: MIT Press.

Goutis, C., and Fearn, T. 1996. Partial Least Squares Regression on Smooth Factors. Journal of the American Statistical Association, 91(434), 627–632.

Graham, D. B., and Allinson, N. M. 1998. Characterizing Virtual Eigensignatures for General Purpose Face Recognition. Pages 446-56 of: Computer and Systems Sciences. NATO ASI Series F, vol. 163. Berlin, Germany: Springer.

Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. 2005. Measuring Statistical Dependence with Hilbert-Schmidt Norms. Pages 63-77 of: Jain, S., Simon, H. U., and Tomita, E. (eds), Algorithmic Learning Theory. Lecture Notes in Artificial Intelligence. Berlin, Germany: Springer-Verlag.

Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. J. 2007. AKernel Method for the Two-Sample-Problem. Pages 513-520 of: Schölkopf, B., Platt, J., and Hoffman, T. (eds), Advances in Neural Information Processing Systems 19.Cambridge, MA: MIT Press.

Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. 2008. A Kernel Statistical Test of Independence. Pages 585-592 of: Platt, J. C, Roller, D., Singer, Y., and Roweis, S. (eds), Advances in Neural Information Processing Systems 20.Cambridge, MA: MIT Press.

Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Schölkopf, B. 2009. Covariate Shift by Kernel Mean Matching. Chap. 8, pages 131—160 of: Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. (eds), Dataset Shift in Machine Learning.Cambridge, MA: MIT Press.

Guralnik, V., and Srivastava, J. 1999. Event Detection from Time Series Data. Pages 33-2 of: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD1999).

Gustafsson, F. 2000. Adaptive Filtering and Change Detection.Chichester, UK: Wiley.

Guyon, I., and Elisseeff, A. 2003. An Introduction to Variable Feature Selection. Journal of Machine Learning Research, 3, 1157—1182.

Hachiya, H., Akiyama, T., Sugiyama, M., and Peters, J. 2009. Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning. Neural Networks, 22(10), 1399–1410.

Hachiya, H., Sugiyama, M., and Ueda, N. 201 1a. Importance-Weighted Least-Squares Probabilistic Classifier for Covariate Shift Adaptation with Application to Human Activity Recognition. Neurocomputing. To appear. Hachiya, H., Peters, J., and Sugiyama, M. 2011b. Reward Weighted Regression with Sample Reuse. Neural Computation, 23(11), 2798–2832.

Hall, P., and Tajvidi, N. 2002. Permutation Tests for Equality of Distributions in Highdimensional Settings. Biometrika, 89(2), 359–374.

Härdle, W., Müller, M., Sperlich, S., and Werwatz, A. 2004. Nonparametric and Semiparametric Models.Berlin, Germany: Springer.

Hartigan, J. A. 1975. Clustering Algorithms.New York: Wiley.

Hastie, T., and Tibshirani, R. 1996a. Discriminant Adaptive Nearest Neighbor Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), 607–615.

Hastie, T., and Tibshirani, R. 1996b. Discriminant Analysis by Gaussian mixtures. Journal of the Royal Statistical Society, Series B, 58(1), 155–176.

Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction.New York: Springer.

Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. 2004. The Entire Regularization Path for the Support Vector Machine. Journal of Machine Learning Research, 5, 1391—1415.

He, X., and Niyogi, P. 2004. Locality Preserving Projections. Pages 153-160 of: Thrun, S., Saul, L., and Schölkopf, B. (eds), Advances in Neural Information Processing Systems 16.Cambridge, MA: MIT Press.

Heckman, J. J. 1979. Sample Selection Bias as a Specification Error. Econometrica, 47(1), 153–161.

Henkel, R. E. 1976. Tests of Significance.Beverly Hills, CA: Sage.

Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. 2011. Statistical Outlier Detection Using Direct Density Ratio Estimation. Knowledge and Information Systems, 26(2), 309–336.

Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507.

Hodge, V., and Austin, J. 2004. A Survey of Outlier Detection Methodologies. Artificial Intelligence Review, 22(2), 85–126.

Hoerl, A. E., and Kennard, R. W. 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(3), 55—67.

Horn, R., and Johnson, C. 1985. Matrix Analysis.Cambridge, UK: Cambridge University Press. Hotelling, H. 1936. Relations between Two Sets of Variates. Biometrika, 28(3-4), 321–377.

Hotelling, H. 1951. A Generalized T Test and Measure of Multivariate Dispersion. Pages 23-41 of: Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability.Berkeley: University of California Press.

Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. 2009. Nonlinear Causal Discovery with Additive Noise Models. Pages 689-696 of: Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds), Advances in Neural Information Processing Systems 21.Cambridge, MA: MIT Press.

Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., and Schölkopf, B. 2007. Correcting Sample Selection Bias by Unlabeled Data. Pages 601-608 of: Schölkopf, B., Platt, J., and Hoffman, T. (eds), Advances in Neural Information Processing Systems 19.Cambridge, MA: MIT Press. Huber, P. J. 1985. Projection Pursuit. The Annals of Statistics, 13(2), 435–75.

Hulle, M. M. Van. 2005. Edgeworth Approximation of Multivariate Differential Entropy. Neural Computation, 17(9), 1903–1910.

Hulle, M.M. Van. 2008. Sequential Fixed-Point ICABased on Mutual Information Minimization. Neural Computation, 20(5), 1344–1365.

Hyvaerinen, A. 1999. Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks, 10(3), 626.

Hyvärinen, A., Karhunen, J., and Oja, E. 2001. Independent Component Analysis. New York: Wiley. Ide, T., and Kashima, H. 2004. Eigenspace-Based Anomaly Detection in Computer Systems. Pages 440-449 of: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2004).

Ishiguro, M., Sakamoto, Y., and Kitagawa, G. 1997. Bootstrapping Log Likelihood and EIC, an Extension of AIC. Annals of the Institute of Statistical Mathematics, 49, 411—434.

Jacoba, P., and Oliveirab, P. E. 1997. Kernel Estimators of General Radon-Nikodym Derivatives. Statistics, 30, 25–46.

Jain, A. K., and Dubes, R. C. 1988. Algorithms for Clustering Data.Englewood Cliffs, NJ: Prentice Hall.

Jaynes, E. T. 1957. Information Theory and Statistical Mechanics. Physical Review, 106(4), 620–630.

Jebara, T. 2004. Kernelized Sorting, Permutation and Alignment for Minimum Volume PCA. Pages 609—623 of: 17th Annual Conference on Learning Theory (COLT2004).

Jiang, X., and Zhu, X. 2009. v Eye: Behavioral Footprinting for Self-Propagating Worm Detection and Profiling. Knowledge and Information Systems, 18(2), 231–262.

Joachims, T. 1999. Making Large-Scale SVM Learning Practical. Pages 169-184 of: Schölkopf, B., Burges, C. J. C., and Smola, A. J. (eds), Advances in Kernel Methods—Support Vector Learning.Cambridge, MA: MIT Press.

Joachims, T. 2006. Training Linear SVMs in Linear Time. Pages 217-226 of: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2006).

Jolliffe, I. T. 1986. Principal Component Analysis.New York: Springer-Verlag.

Jones, M. C., Hjort, N.L., Harris, I. R., and Basu, A. 2001. A Comparison of Related Density-based Minimum Divergence Estimators. Biometrika, 88, 865–873.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. 1999. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2), 183.

Jutten, C., and Herault, J. 1991. Blind Separation of Sources, Part I: An Adaptive algorithm Based on Neuromimetic Architecture. Signal Processing, 24(1), 1–10.

Kanamori, T. 2007. Pool-Based Active Learning with Optimal Sampling Distribution and its Information Geometrical Interpretation. Neurocomputing, 71(1—3), 353—362.

Kanamori, T., and Shimodaira, H. 2003. Active Learning Algorithm Using the Maximum Weighted Log-Likelihood Estimator. Journal of Statistical Planning and Inference, 116(1), 149–162.

Kanamori, T., Hido, S., and Sugiyama, M. 2009. A Least-squares Approach to Direct Importance Estimation. Journal of Machine Learning Research, 10(Jul.), 1391—1445.

Kanamori, T., Suzuki, T., and Sugiyama, M. 2010. Theoretical Analysis of Density Ratio Estimation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E93-A(4), 787–798.

Kanamori, T., Suzuki, T., and Sugiyama, M. 2011a. f-Divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models. IEEE Transactions on Information Theory. To appear.

Kanamori, T., Suzuki, T., and Sugiyama, M. 2011b. Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation. Machine Learning. To appear.

Kanamori, T., Suzuki, T., and Sugiyama, M. 2011c. Kernel-Based Least-Squares Density-Ratio Estimation II. Condition Number Analysis. Machine Learning. submitted.

Kankainen, A. 1995. Consistent Testing of Total Independence Based on the Empirical Characteristic Function. Ph.D. thesis, University of Jyväskylä, Jyväskylä, Finland. Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A. 2004. kernlab—An S4 Package for Kernel Methods in R. Journal of Statistical Planning and Inference, 11(9), 1–20.

Kashima, H., and Koyanagi, T. 2002. Kernels for Semi-Structured Data. Pages 291-298 of: Proceedings of the Nineteenth International Conference on Machine Learning.Kashima, H., Tsuda, K., and Inokuchi, A. 2003. Marginalized Kernels between Labeled Graphs. Pages 321-328 of: Proceedings of the Twentieth International Conference on Machine Learning.

Kato, T., Kashima, H., Sugiyama, M., and Asai, K. 2010. Conic Programming for Multi-Task Learning. IEEE Transactions on Knowledge and Data Engineering, 22(7), 957—968.

Kawahara, Y., and Sugiyama, M. 2011. Sequential Change-Point Detection Based on Direct Density-Ratio Estimation. Statistical Analysis and Data Mining. To appear. Kawanabe, M., Sugiyama, M., Blanchard, G., and Müller, K.-R. 2007. A New Algorithm of Non-Gaussian Component Analysis with Radial Kernel Functions. Annals of the Institute of Statistical Mathematics, 59(1), 57–75.

Ke, Y., Sukthankar, R., and Hebert, M. 2007. Event Detection in Crowded Videos. Pages 1-8 of: Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV2007).

Keziou, A. 2003a. Dual Representation of ϕ-Divergences and Applications. Comptes Rendus Mathématique, 336(10), 857–862.

Keziou, A. 2003b. Utilisation Des Divergences Entre Mesures en Statistique Inferentielle. Ph.D. thesis, UPMC University. in French.

Keziou, A., and Leoni-Aubin, S. 2005. Test of Homogeneity in Semiparametric Two-sample Density Ratio Models. Comptes Rendus Mathématique, 340(12), 905—910.

Keziou, A., and Leoni-Aubin, S. 2008. On Empirical Likelihood for Semiparametric Two-Sample Density Ratio Models. Journal of Statistical Planning and Inference, 138(4), 915–928.

Khan, S., Bandyopadhyay, S., Ganguly, A., and Saigal, S. 2007. Relative Performance of Mutual Information Estimation Methods for Quantifying the Dependence among Short and Noisy Data. Physical Review E, 76, 026209.

Kifer, D., Ben-David, S, and Gehrke, J. 2004. Detecting Change in Data Streams. Pages 180-191 of: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB2004).

Kimeldorf, G. S., and Wahba, G. 1971. Some Results on Tchebycheffian Spline Functions. Journal of Mathematical Analysis and Applications, 33(1), 82–95.

Kimura, M., and Sugiyama, M. 2011. Dependence-Maximization Clustering with Least-Squares Mutual Information. Journal of Advanced Computational Intelligence and Intelligent Informatics, 15(7), 800–805.

Koh, K., Kim, S.-J., and Boyd, S. P. 2007. An Interior-point Method for Large-Scale h-Regularized Logistic Regression. Journal of Machine Learning Research, 8, 1519–1555.

Kohonen, T. 1988. Learning Vector Quantization. Neural Networks, 1 (Supplementary 1), 303.

Kohonen, T. 1995. Self-Organizing Maps.Berlin, Germany: Springer.

Koltchinskii, V 2006 Local Rademacher Complexities and Oracle Inequalities in Risk Minimization. The Annals of Statistics, 34, 2593—2656.

Kondor, R. I., and Lafferty, J. 2002. Diffusion Kernels on Graphs and Other Discrete Input Spaces. Pages 315-322 of: Proceedings of the Nineteenth International Conference on Machine Learning.Konishi, S., and Kitagawa, G. 1996. Generalized Information Criteria in Model Selection. Biometrika, 83(4), 875–890.

Korostelëv, A. P., and Tsybakov, A. B. 1993. Minimax Theory of Image Reconstruction.New York: Springer.

Kraskov, A., Stögbauer, H., and Grassberger, P. 2004. Estimating Mutual Information. Physical Review E, 69(6), 066138.

Kullback, S. 1959. Information Theory and Statistics.New York: Wiley.

Kullback, S., and Leibler, R. A. 1951. On Information and Sufficiency. Annals of Mathematical Statistics, 22, 79–86.

Kurihara, N., Sugiyama, M., Ogawa, H., Kitagawa, K., and Suzuki, K. 2010. Iteratively-Reweighted Local Model Fitting Method for Adaptive and Accurate Single-Shot Surface Profiling. Applied Optics, 49(22), 4270–4277.

Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Pages 282—289 of: Proceedings of the 18th International Conference on Machine Learning.

Lagoudakis, M. G., and Parr, R. 2003. Least-Squares Policy Iteration. Journal of Machine Learning Research, 4, 1107—1149.

Lapedriza, À., Masip, D., and Vitrià, J. 2007. A Hierarchical Approach for Multi-task Logistic Regression. Pages 258-265 of: Mart, J., Bened, J. M., Mendonga, A. M., and Serrat, J. (eds), Proceedings of the 3rd Iberian Conference on Pattern Recognition and Image Analysis, Part <I>II. Lecture Notes in Computer Science, vol. 4478. Berlin, Germany: Springer-Verlag.

Larsen, J., and Hansen, L. K. 1996. Linear Unlearning for Cross-Validation. Advances in Computational Mathematics, 5, 269–280.

Latecki, L. J., Lazarevic, A., and Pokrajac, D. 2007. Outlier Detection with Kernel Density Functions. Pages 61-75 of: Proceedings of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition.Lee, T.-W., Girolami, M., and Sejnowski, T. J. 1999. Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources. Neural Computation, 11(2), 417–441.

Lehmann, E. L. 1986. Testing Statistical Hypotheses. 2nd edn. New York: Wiley.

Lehmann, E. L., and Casella, G. 1998. Theory of Point Estimation. 2nd edn. New York: Springer.

Li, K. 1991. Sliced Inverse Regression for Dimension Reduction. Journal of the AmericanStatistical Association, 86(414), 316–342.

Li, K. 1992. On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma. Journal of the American Statistical Association, 87(420), 1025–1039.

Li, K. C., Lue, H. H., and Chen, C. H. 2000. Interactive Tree-structured Regression via Principal Hessian Directions. Journal of the American Statistical Association, 95(450), 547—560.

Li, L., and Lu, W. 2008. Sufficient Dimension Reduction with Missing Predictors. Journal of the American Statistical Association, 103(482), 822–831.

Li, Q. 1996. Nonparametric Testing of Closeness between Two Unknown Distribution Functions. Econometric Reviews, 15(3), 261—274.

Li, Y., Liu, Y., and Zhu, J. 2007. Quantile Regression in Reproducing Kernel Hilbert Spaces. Journal of the American Statistical Association, 102(477), 255–268.

Li, Y., Kambara, H., Koike, Y., and Sugiyama, M. 2010. Application of Covariate Shift Adaptation Techniques in Brain Computer Interfaces. IEEE Transactions on Biomedical Engineering, 57(6), 1318–1324.

Lin, Y 2002. Support Vector Machines and the Bayes Rule in Classification. Data Mining and Knowledge Discovery, 6(3), 259–275.

Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. 2002. Text Classification Using String Kernels. Journal of Machine Learning Research, 2, 419–44.

Luenberger, D., and Ye, Y. 2008. Linear and Nonlinear Programming.Reading, MA: Springer.

Luntz, A., and Brailovsky, V. 1969. On Estimation of Characters Obtained in Statistical Procedure of Recognition. Technicheskaya Kibernetica, 3. in Russian.

MacKay, D. J. C. 2003. Information Theory, Inference, and Learning Algorithms.Cambridge, UK: Cambridge University Press.

MacQueen, J. B. 1967. Some Methods for Classification and Analysis of Multivariate Observations. Pages 281-297 of: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. Berkeley: University of California Press.

Mallows, C. L. 1973. Some Comments on CP. Technometrics, 15(4), 661—675.

Manevitz, L. M., and Yousef, M. 2002. One-Class SVMs for Document Classification. Journal of Machine Learning Research, 2, 139–154.

Meila, M., and Heckerman, D. 2001. An Experimental Comparison of Model-Based Clustering Methods. Machine Learning, 42(1/2), 9.

Mendelson, S. 2002. Improving the Sample Complexity Using Global Data. IEEE Transactions on Information Theory, 48(7), 1977–1991.

Mercer, J. 1909. Functions of Positive and Negative Type and Their Connection with the Theory of Integral Equations. Philosophical Transactions of the Royal Society of London, A-209, 415–46.

Micchelli, C. A., and Pontil, M. 2005. Kernels for Multi-Task Learning. Pages 921-928 of: Saul, L. K., Weiss, Y., and Bottou, L. (eds), Advances in Neural Information Processing Systems 17.Cambridge, MA: MIT Press.

Minka, T. P. 2007. AComparison of Numerical Optimizers for Logistic Regression. Tech. rept. Microsoft Research.

Moré, J. J., and Sorensen, D. C. 1984. Newton's Method. In: Golub, G. H. (ed), Studies in Numerical Analysis.Washington, DC: Mathematical Association of America.

Mori, S., Sugiyama, M., Ogawa, H., Kitagawa, K., and Irie, K. 2011. Automatic Parameter Optimization of the Local Model Fitting Method for Single-shot Surface Profiling. Applied Optics, 50(21), 3773–3780.

Müller, A. 1997. Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied Probability, 29, 429—443.

Murad, U., and Pinkas, G. 1999. Unsupervised Profiling for Identifying Superimposed Fraud. Pages 251-261 of: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD1999).

Murata, N., Yoshizawa, S., and Amari, S. 1994. Network Information Criterion — Determining the Number of Hidden Units for an Artificial Neural Network Model. IEEE Transactions on Neural Networks, 5(6), 865–872.

Ng, A. Y., Jordan, M. I., and Weiss, Y 2002. On Spectral Clustering: Analysis and An Algorithm. Pages 849—856 of: Dietterich, T. G., Becker, S., and Ghahramani, Z. (eds), Advances in Neural Information Processing Systems 14.Cambridge, MA: MIT Press.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. 2010. Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization. IEEE Transactions on Information Theory, 56(11), 5847–5861.

Nishimori, Y., and Akaho, S. 2005. Learning Algorithms Utilizing Quasi-geodesic Flows on the Stiefel Manifold. Neurocomputing, 67, 106–135.

Oja, E. 1982. A Simplified Neuron Model as a Principal Component Analyzer. Journal of Mathematical Biology, 15(3), 267–273.

Oja, E. 1989. Neural Networks, Principal Components and Subspaces. International Journal of Neural Systems, 1, 61–68.

Patriksson, M. 1999. Nonlinear Programming and Variational Inequality Problems. Dordrecht, the Netherlands: Kluwer Academic. Pearl, J. 2000. Causality: Models, Reasning and Inference.New York: Cambridge University Press.

Pearson, K. 1900. On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling. Philosophical Magazine Series 5, 50(302), 157–175.

Pérez-Cruz, F. 2008. Kullback-Leibler Divergence Estimation of Continuous Distributions. Pages 1666—1670 of: Proceedings of IEEE International Symposium on Information Theory.

Platt, J. 1999. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. Pages 169-184 of: Schölkopf, B., Burges, C. J. C., and Smola, A. J. (eds), Advances in Kernel Methods—Support Vector Learning.Cambridge, MA: MIT Press.

Platt, J. 2000. Probabilities for SV Machines. In: Smola, A. J., Bartlett, P. L., Schölkopf, B., and Schuurmans, D. (eds), Advances in Large Margin Classifiers.Cambridge, MA: MIT Press.

Plumbley, M. D. 2005. Geometrical Methods for Non-Negative ICA: Manifolds, Lie Groups and Toral Subalgebras. Neurocomputing, 67(Aug.), 161–197. Press, W., H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1992. Numerical Recipes in C. 2nd edn. Cambridge, UK: Cambridge University Press.

Pukelsheim, F. 1993. Optimal Design of Experiments.New York: Wiley.

Qin, J. 1998. Inferences for Case-control and Semiparametric Two-sample Density Ratio Models. Biometrika, 85(3), 619–630.

Qing, W., Kulkarni, S. R., and Verdu, S. 2006. A Nearest-Neighbor Approach to Estimating Divergence between Continuous Random Vectors. Pages 242-246 of: Proceedings of IEEE International Symposium on Information Theory.

Quadrianto, N., Smola, A. J., Song, L., and Tuytelaars, T. 2010. Kernelized Sorting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1809—1821.

Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. (eds). 2009. Dataset Shift in Machine Learning.Cambridge, MA: MIT Press.

R, Development Core Team. 2009. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Vienna, Austria. http://www.r-project.org.

Rao, C. 1945. Information and the Accuracy Attainable in the Estimation of Statistical Parameters. Bulletin of the Calcutta Mathematics Society, 37, 81–89.

Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian Processes for Machine Learning.Cambridge, MA: MIT Press.

Rätsch, G., Onoda, T., and Müller, K.-R. 2001. Soft Margins for Ada Boost. Machine Learning, 42(3), 287–320.

Reiss, P. T., and Ogden, R. T. 2007. Functional Principal Component Regression and Functional Partial Least Squares. Journal of the American Statistical Association, 102(479), 984—996.

Rifkin, R., Yeo, G., and Poggio, T. 2003. Regularized Least-Squares Classification. Pages 131—154 of: Suykens, J. A. K., Horvath, G., Basu, S., Micchelli, C., and Vandewalle, J. (eds), Advances in Learning Theory: Methods, Models and Applications. NATO Science Series III: Computer & Systems Sciences, vol. 190. Amsterdam, the Netherlands: IOS Press.

Rissanen, J. 1978. Modeling by Shortest Data Description. Automatica, 14(5), 465—471. Rissanen, J. 1987. Stochastic Complexity. Journal of the Royal Statistical Society, Series B, 49(3), 223–239.

Rockafellar, R. T. 1970. Convex Analysis.Princeton, NJ: Princeton University Press.

Rosenblatt, M. 1956. Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics, 27, 832–837.

Roweis, S., and Saul, L. 2000. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500), 2323–2326.

Sankar, ., Spielman, D. A., and Teng, S.-H. 2006. Smoothed Analysis of the Condition Numbers and Growth Factors of Matrices. SIAM Journal on Matrix Analysis and Applications, 28(2), 446–176.

Saul, L. K., and Roweis, S. T. 2003. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research, 4(Jun), 119—155.

Schapire, R., Freund, Y., Bartlett, P., and Lee, W. Sun. 1998. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. Annals of Statistics, 26, 1651–1686.

Scheinberg, K. 2006. An Efficient Implementation of an Active Set Method for SVMs. Journal of Machine Learning Research, 7, 2237–2257.

Schmidt, M. 2005. min Func. http://people.cs.ubc.ca/∼schmidtm/Software/min Func.html.

Schölkopf, B., and Smola, A. J. 2002. Learning with Kernels.Cambridge, MA: MIT Press.

Schölkopf, B., Smola, A., and Müller, K.-R. 1998. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5), 1299—1319.

Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. 2001. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7), 1443–1471.

Schwarz, G. 1978. Estimating the Dimension of a Model. The Annals of Statistics, 6, 461–464. Shi, J., and Malik, J. 2000. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.

Shibata, R. 1981. An Optimal Selection of Regression Variables. Biometrika, 68(1), 45–54.

Shibata, R. 1989. Statistical Aspects of Model Selection. Pages 215-240 of: Willems, J. C. (ed), From Data to Model.New York: Springer-Verlag.

Shimodaira, H. 2000. Improving Predictive Inference under Covariate Shift by Weighting the Log-Likelihood Function. Journal of Statistical Planning and Inference, 90(2), 227–244.

Silva, J., and Narayanan, S. 2007. Universal Consistency of Data-Driven Partitions for Divergence Estimation. Pages 2021—2025 of: Proceedings of IEEE International Symposium on Information Theory.Simm, J., Sugiyama, M., and Kato, T. 2011. Computationally Efficient Multi-task Learning with Least-Squares Probabilistic Classifiers. IPSJ Transactions on Computer Vision and Applications, 3, 1—8.

Smola, A., Song, L., and Teo, C. H. 2009. Relative Novelty Detection. Pages 536—543 of: van Dyk, D., and Welling, M. (eds), Proceedings of Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS2009). JMLR Workshop and Conference Proceedings, vol. 5.

Song, L., Smola, A., Gretton, A., and Borgwardt, K. 2007a. A Dependence Maximization View of Clustering. Pages 815-822 of: Ghahramani, Z. (ed), Proceedings of the 24th Annual International Conference on Machine Learning (ICML2007).

Song, L., Smola, A., Gretton, A., Borgwardt, K. M., and Bedo, J. 2007b. Supervised Feature Selection via Dependence Estimation. Pages 823—830 of: Ghahramani, Z. (ed), Proceedings of the 24th Annual International Conference on Machine Learning (ICML2007).

Spielman, D. A., and Teng, S.-H. 2004. Smoothed Analysis of Algorithms: Why the Simplex Algorithm Usually Takes Polynomial Time. Journal of the ACM, 51(3), 385–463.

Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, G., and Schölkopf, B. 2009. Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions. Pages 1750—1758 of: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A. (eds), Advances in Neural Information Processing Systems 22.Cambridge, MA: MIT Press.

Steinwart, I.2001. On the Influence of the Kernel on the Consistency of Support Vector Machines. Journal of Machine Learning Research, 2, 67—93.

Steinwart, I., Hush, D., and Scovel, C. 2009. Optimal Rates for Regularized Least Squares Regression. Pages 79-93 of: Proceedings of the Annual Conference on Learning Theory.

Stone, M. 1974. Cross-validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society, Series B, 36, 111–147.

Storkey, A., and Sugiyama, M. 2007. Mixture Regression for Covariate Shift. Pages 1337-1344 of: Schölkopf, B., Platt, J. C., and Hoffmann, T. (eds), Advances in Neural Information Processing Systems 19.Cambridge, MA: MIT Press.

Student. 1908. The Probable Error of A Mean. Biometrika 6, 1–25.

Sugiyama, M. 2006. Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error. Journal of Machine Learning Research, 7(Jan.), 141–166.

Sugiyama, M. 2007. Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis. Journal of Machine Learning Research, 8(May), 1027—1061.

Sugiyama, M. 2009. On Computational Issues of Semi-supervised Local Fisher Discriminant Analysis. IEICE Transactions on Information and Systems, E92-D(5), 1204–1208.

Sugiyama, M. 2010. Superfast-Trainable Multi-Class Probabilistic Classifier by Least-Squares Posterior Fitting. IEICE Transactions on Information and Systems, E93-D(10), 2690—2701.

Sugiyama, M., and Kawanabe, M. 2012. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation.Cambridge, MA: MIT Press. to appear.

Sugiyama, M., and Müller, K.-R. 2002. The Subspace Information Criterion for Infinite Dimensional Hypothesis Spaces. Journal of Machine Learning Research, 3(Nov.), 323–359.

Sugiyama, M., and Müller, K.-R. 2005. Input-Dependent Estimation of Generalization Error under Covariate Shift. Statistics & Decisions, 23(4), 249–279.

Sugiyama, M., and Nakajima, S. 2009. Pool-based Active Learning in Approximate Linear Regression. Machine Learning, 75(3), 249—274.

Sugiyama, M., and Ogawa, H. 2000. Incremental Active Learning for Optimal Generalization. Neural Computation, 12(12), 2909–2940.

Sugiyama, M., and Ogawa, H. 2001a. Active Learning for Optimal Generalization in Trigonometric Polynomial Models. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E84-A(9), 2319–2329.

Sugiyama, M., and Ogawa, H. 2001b. Subspace Information Criterion for Model Selection. Neural Computation, 13(8), 1863–1889.

Sugiyama, M., and Ogawa, H. 2003. Active Learning with Model Selection—Simultaneous Optimization of Sample Points and Models for Trigonometric Polynomial Models. IEICE Transactions on Information and Systems, E86-D(12), 2753–2763.

Sugiyama, M., and Rubens, N. 2008. A Batch Ensemble Approach to Active Learning with Model Selection. Neural Networks, 21(9), 1278–1286.

Sugiyama, M., and Suzuki, T. 2011. Least-Squares Independence Test. IEICE Transactions on Information and Systems, E94-D(6), 1333–1336.

Sugiyama, M., Kawanabe, M., and Müller, K.-R. 2004. Trading Variance Reduction with Unbiasedness: The Regularized Subspace Information Criterion for Robust Model Selection in Kernel Regression. Neural Computation, 16(5), 1077–1104.

Sugiyama, M., Ogawa, H., Kitagawa, K., and Suzuki, K. 2006. Single-shot Surface Profiling by Local Model Fitting. Applied Optics, 45(31), 7999–8005.

Sugiyama, M., Krauledat, M., and Müller, K.-R. 2007. Covariate Shift Adaptation by Importance Weighted Cross Validation. Journal of Machine Learning Research, 8(May), 985–1005.

Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., and Kawanabe, M. 2008. Direct Importance Estimation for Covariate Shift Adaptation. Annals of the Institute of Statistical Mathematics, 60(4), 699–746.

Sugiyama, M., Kanamori, T., Suzuki, T., Hido, S., Sese, J., Takeuchi, I., and Wang, L. 2009. A Density-ratio Framework for Statistical Data Processing. IPSJ Transactions on Computer Vision and Applications, 1, 183—208. Sugiyama, M., Kawanabe, M., and Chui, P. L. 2010a. Dimensionality Reduction for Density Ratio Estimation in High-dimensional Spaces. Neural Networks, 23(1), 44–59.

Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., and Okanohara, D. 2010b. Least-squares Conditional Density Estimation. IEICE Transactions on Information and Systems, E93-D(3), 583–594.

Sugiyama, M., Idé, T., Nakajima, S., and Sese, J. 2010c. Semi-supervised Local Fisher Discriminant Analysis for Dimensionality Reduction. Machine Learning, 78(1-2), 35–61.

Sugiyama, M., Suzuki, T., and Kanamori, T. 2011a. Density Ratio Matching under the Bregman Divergence: A Unified Framework of Density Ratio Estimation. Annals of the Institute of Statistical Mathematics. To appear.

Sugiyama, M., Yamada, M., von Bünau, P., Suzuki, T., Kanamori, T., and Kawanabe, M. 2011b. Direct Density-ratio Estimation with Dimensionality Reduction via Least-squares Hetero-distributional Subspace Search. Neural Networks, 24(2), 183–198.

Sugiyama, M., Suzuki, T., Itoh, Y., Kanamori, T., and Kimura, M. 2011c. Least-Squares Two-Sample Test. Neural Networks, 24(7), 735–751.

Sugiyama, M., Yamada, M., Kimura, M., and Hachiya, H. 2011d. On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution. In: Proceedings of 28th International Conference on Machine Learning (ICML2011),65–72.

Sutton, R. S., and Barto, G. A. 1998. Reinforcement Learning: An Introduction.Cambridge, MA: MIT Press.

Suykens, J. A. K., Gestel, T. Van, Brabanter, J. De, Moor, B. De, and Vandewalle, J. 2002. Least Squares Support Vector Machines. Singapore: World Scientific Pub. Co. Suzuki, T., and Sugiyama, M. 2010. Sufficient Dimension Reduction via Squared-loss Mutual Information Estimation. Pages 804—811 of: Teh, Y. W., and Tiggerington, M. (eds), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS2010). JMLR Workshop and Conference Proceedings, vol. 9.

Suzuki, T., and Sugiyama, M. 2011. Least-Squares Independent Component Analysis. Neural Computation, 23(1), 284–301.

Suzuki, T., Sugiyama, M., Sese, J., and Kanamori, T. 2008. Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation. Pages 5-20 of: Saeys, Y., Liu, H., Inza, I., Wehenkel, L., and Peer, , Y, Van (eds), Proceedings of ECML-PKDD2008 Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery 2008 (FSDM2008). JMLR Workshop and Conference Proceedings, vol. 4.

Suzuki, T., Sugiyama, M., and Tanaka, T. 2009a. Mutual Information Approximation via Maximum Likelihood Estimation of Density Ratio. Pages 463—467 of: Proceedings of 2009 IEEE International Symposium on Information Theory (ISIT2009).

Suzuki, T., Sugiyama, M., Kanamori, T., and Sese, J. 2009b. Mutual Information Estimation Reveals Global Associations between Stimuli and Biological Processes. BMC Bioinformatics, 10(1), S52.

Suzuki, T., Sugiyama, M., and Tanaka, Toshiyuki. 2011. Mutual Information Approximation via Maximum Likelihood Estimation of Density Ratio. in preparation. Takeuchi, I., Le, Q. V., Sears, T., D., and Smola, A. J. 2006. Nonparametric Quantile Estimation. Journal of Machine Learning Research, 7, 1231–1264.

Takeuchi, I., Nomura, K., and Kanamori, T. 2009. Nonparametric Conditional Density Estimation Using Piecewise-linear Solution Path of Kernel Quantile Regression. Neural Computation, 21(2), 533–559.

Takeuchi, K. 1976. Distribution of Information Statistics and Validity Criteria of Models. Mathematical Science, 153, 12—18. in Japanese.

Takimoto, M., Matsugu, M., and Sugiyama, M. 2009. Visual Inspection of Precision Instruments by Least-Squares Outlier Detection. Pages 22-26 of: Proceedings of The Fourth International Workshop on Data-Mining and Statistical Science (DMSS2009).

Talagrand, M. 1996a. New Concentration Inequalities in Product Spaces. Inventiones Mathematicae, 126, 505–563.

Talagrand, M. 1996b. ANew Look at Independence. The Annals of Statistics, 24, 1–34.

Tang, Y., and Zhang, H. H. 2006. Multiclass Proximal Support Vector Machines. Journal of Computational and Graphical Statistics, 15(2), 339—355.

Tao, T., and Vu, V H. 2007. The Condition Number of a Randomly Perturbed Matrix. Pages 248—255 of: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing.New York: ACM.

Tax, D. M. J., and Duin, R. P. W. 2004. Support Vector Data Description. Machine Learning, 54(1), 45–66.

Tenenbaum, J. B., de Silva, V., and Langford, J. C. 2000. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500), 2319–2323.

Teo, C. H., Le, Q., Smola, A., and Vishwanathan, S. V. N. 2007. A Scalable Modular Convex Solver for Regularized Risk Minimization. Pages 727—736 of: ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2007).Tibshirani, R. 1996. Regression Shrinkage and Subset Selection with the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.

Tipping, M. E., and Bishop, C. M. 1999. Mixtures of Probabilistic Principal Component Analyzers. Neural Computation, 11(2), 443–482.

Tresp, V. 2001. Mixtures of Gaussian Processes. Pages 654-660 of: Leen, T. K., Dietterich, T. G., and Tresp, V. (eds), Advances in Neural Information Processing Systems 13.Cambridge, MA: MIT Press.

Tsang, I., Kwok, J., and Cheung, P.-M. 2005. Core Vector Machines: Fast SVM Training on Very Large Data Sets. Journal of Machine Learning Research, 6, 363–392.

Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. 2009. Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation. Journal of Information Processing, 17, 138–155.

Ueki, K., Sugiyama, M., and Ihara, Y. 2011. Lighting Condition Adaptation for Perceived Age Estimation. IEICE Transactions on Information and Systems, E94-D(2), 392—395.

van de Geer, S. 2000. Empirical Processes in M-Estimation.Cambridge, UK: Cambridge University Press.

van der Vaart, A. W. 1998. Asymptotic Statistics.Cambridge, UK: Cambridge University Press.

vander Vaart, A. W., and Wellner, J. A. 1996. Weak Convergence and Empirical Processes. with Applications to Statistics.New York: Springer-Verlag.

Vapnik, V. N. 1998. Statistical Learning Theory.New York: Wiley.

Wahba, G. 1990. Spline Models for Observational Data.Philadelphia, PA: Society for Industrial and Applied Mathematics.

Wang, Q., Kulkarmi, S. R., and Verdú, S. 2005. Divergence Estimation of Contiunous Distributions Based on Data-Dependent Partitions. IEEE Transactions on Information Theory, 51(9), 3064–3074.

Watanabe, S. 2009. Algebraic Geometry and Statistical Learning Theory.Cambridge, UK: Cambridge University Press.

Weinberger, K., Blitzer, J., and Saul, L. 2006. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Pages 1473-1480 of: Weiss, Y., Schölkopf, B., and Platt, J. (eds), Advances in Neural Information Processing Systems 18.Cambridge, MA: MIT Press.

Weisberg, S. 1985. Applied Linear Regression.New York: John Wiley.

Wichern, G., Yamada, M., Thornburg, H., Sugiyama, M., and Spanias, A. 2010 (Mar. 14—19). Automatic Audio Tagging Using Covariate Shift Adaptation. Pages 253-256 of: Proceedings of 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2010).

Wiens, D.P. 2000. Robust Weights and Designs for Biased Regression Models: Least Squares and Generalized M-Estimation. Journal of Statistical Planning and Inference, 83(2), 395–412.

Williams, P. M. 1995. Bayesian Regularization and Pruning Using a Laplace Prior. Neural Computation, 7(1), 117–143.

Wold, H. 1966. Estimation of Principal Components and Related Models by Iterative Least Squares. Pages 391-420 of: Krishnaiah, P. R.(ed),Multivariate Analysis.New York: Academic Press.

Wolff, R. C. L., Yao, Q., and Hall, P. 1999. Methods for Estimating a Conditional Distribution Function. Journal of the American Statistical Association, 94(445), 154—163.

Wu, T.-F., Lin, C.-J., and Weng, R. C. 2004. Probability Estimates for Multi-Class Classification by Pairwise Coupling. Journal of Machine Learning Research, 5, 975–1005.

Xu, L., Neufeld, J., Larson, B., and Schuurmans, D. 2005. Maximum Margin Clustering. Pages 1537—1544 of: Saul, L. K., Weiss, Y., and Bottou, L. (eds), Advances in Neural Information Processing Systems 17.Cambridge, MA: MIT Press.

Xue, Y., Liao, X., Carin, L., and Krishnapuram, B. 2007. Multi-Task Learning for Classification with Dirichlet Process Priors. Journal of Machine Learning Research, 8, 35—63.

Yamada, M., and Sugiyama, M. 2009. Direct Importance Estimation with Gaussian Mixture Models. IEICE Transactions on Information and Systems, E92-D(10), 2159–2162.

Yamada, M., and Sugiyama, M. 2010. Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise. Pages 643—648 of: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI2010).Atlanta, GA: AAAI Press.

Yamada, M., and Sugiyama, M. 2011a. Cross-Domain Object Matching with Model Selection. Pages 807—815 in Gordon, G., Dunson, D., and Dudík, M. (eds), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 15.

Yamada, M., and Sugiyama, M. 2011b. Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis. Pages 549—554 of: Proceedings of the Twenty-Fifth AAAIConference on Artificial Intelligence (AAAI2011).San Francisco: AAAI Press.

Yamada, M., Sugiyama, M., Wichern, G., and Simm, J. 2010a. Direct Importance Estimation with a Mixture of Probabilistic Principal Component Analyzers. IEICE Transactions on Information and Systems, E93-D(10), 2846–2849.

Yamada, M., Sugiyama, M., and Matsui, T. 2010b. Semi-supervised Speaker Identification under Covariate Shift. Signal Processing, 90(8), 2353–2361.

Yamada, M., Sugiyama, M., Wichern, G., and Simm, J. 2011a. Improving the Accuracy of Least-Squares Probabilistic Classifiers. IEICE Transactions on Information and Systems, E94-D(6), 1337–1340.

Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M. 2011b. Relative Density-Ratio Estimation for Robust Distribution Comparison. To appear in “Advances in Neural Information Processing Systems,” vol. 24.

Yamada, M., Niu, G., Takagi, J., and Sugiyama, M. 2011c. Computationally Efficient Sufficient Dimension Reduction via Squared-Loss Mutual Information. Pages 247-262 of: C.-N., Hsu and W. S., Lee (eds.), Proceedings of the Third Asian Conference on Machine Learning (ACML 2011), JMLR Workshop and Conference Proceedings, vol. 20.

Yamanishi, K., and Takeuchi, J. 2002. A Unifying Framework for Detecting Outliers and Change Points from Non-Stationary Time Series Data. Pages 676-681 of: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2002).

Yamanishi, K., Takeuchi, J., Williams, G., and Milne, P. 2004. On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Mining and Knowledge Discovery, 8(3), 275–300.

Yamazaki, K., Kawanabe, M., Watanabe, S., Sugiyama, M., and Müller, K.-R. 2007. Asymptotic Bayesian Generalization Error When Training and Test Distributions Are Different. Pages 1079—1086 of: Ghahramani, Z. (ed), Proceedings of 24th International Conference on Machine Learning (ICML2007).

Yankov, D., Keogh, E., and Rebbapragada, U. 2008. Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets. Knowledge and Information Systems, 17(2), 241–262.

Yokota, T., Sugiyama, M., Ogawa, H., Kitagawa, K., and Suzuki, K. 2009. The Interpolated Local Model Fitting Method for Accurate and Fast Single-shot Surface Profiling. Applied Optics, 48(18), 3497–3508.

Yu, K., Tresp, V., and Schwaighofer, A. 2005. Learning Gaussian Processes from Multiple Tasks. Pages 1012-1019 of: Proceedings of the 22nd International Conference on Machine Learning (ICML2005).New York: ACM.

Zadrozny, B. 2004. Learning and Evaluating Classifiers under Sample Selection Bias. Pages 903—910 of: Proceedings of the Twenty-First International Conference on Machine Learning (ICML2004).New York: ACM.

Zeidler, E. 1986. Nonlinear Functional Analysis and Its Applications, I: Fixed-Point Theorems.New York: Springer-Verlag.

Zelnik-Manor, L., and Perona, P. 2005. Self-Tuning Spectral Clustering. Pages 1601—1608 of: Saul, L. K., Weiss, Y., and Bottou, L. (eds), Advances in Neural Information Processing Systems 17.Cambridge, MA: MIT Press.

Zhu, L., Miao, B., and Peng, H. 2006. On Sliced Inverse Regression with High-Dimensional Covariates. Journal of the American Statistical Association, 101(474), 630—643.

Density Ratio Estimation in Machine Learning

Book description

Reviews

Refine List

Actions for selected content:

Contents

Frontmatter
pp i-iv

Contents
pp v-viii

Foreword
pp ix-x

Preface
pp xi-xii

Part I - Density-Ratio Approach to Machine Learning
pp 1-2

1 - Introduction
pp 3-20

Part II - Methods of Density-Ratio Estimation
pp 21-24

2 - Density Estimation
pp 25-38

3 - Moment Matching
pp 39-46

4 - Probabilistic Classification
pp 47-55

5 - Density Fitting
pp 56-66

6 - Density-Ratio Fitting
pp 67-74

7 - Unified Framework
pp 75-88

8 - Direct Density-Ratio Estimation with Dimensionality Reduction
pp 89-116

Part III - Applications of Density Ratios in Machine Learning
pp 117-118

9 - Importance Sampling
pp 119-139

10 - Distribution Comparison
pp 140-162

11 - Mutual Information Estimation
pp 163-190

12 - Conditional Probability Estimation
pp 191-212

Part IV - Theoretical Analysis of Density-Ratio Estimation
pp 213-214

13 - Parametric Convergence Analysis
pp 215-235

14 - Non-Parametric Convergence Analysis
pp 236-251

15 - Parametric Two-Sample Test
pp 252-274

16 - Non-Parametric Numerical Stability Analysis
pp 275-300

Part V - Conclusions
pp 301-302

17 - Conclusions and Future Directions
pp 303-306

List of Symbols and Abbreviations
pp 307-308

Bibliography
pp 309-326

Index
pp 327-329

Metrics

Altmetric attention score

Full text views

Book summary page views

Accessibility standard: Unknown

Density Ratio Estimation in Machine Learning

Book description

Reviews

Refine List

Actions for selected content:

Save Search

Contents

Metrics

Altmetric attention score

Full text views

Book summary page views

Accessibility standard: Unknown