Alpaydın, E., 2004. Introduction to Machine Learning. Cambridge, UK: MIT Press.
Baeza-Yates, R., and Ribeiro-Neto, B., 1999. Modern Information Retrieval. New York: Addison-Wesley.
Branavan, S. R. K., Deshpande, P., and Barzilay, R. 2007. Generating a table-of-contents. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 176–183, Prague, Czech Republic.
Brugger, R., Zramdini, A., and Ingold, R., 1997. Modeling documents for structure recognition using generalized n-grams. In Proceedings of International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 56–60.
Buyukkokten, O., Garcia-Molina, H., and Paepcke, A., 2001. Accordion summarization for end-game browsing on PDAs and cellular phones. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York, NY, USA, pp. 213–220.
Buyukkokten, O., Kaljuvee, O., Garcia-Molina, H., Paepcke, A., and Winograd, T., 2002. Efficient web browsing on handheld devices using page and form summarization. ACM Transactions on Information Systems 20 (1): 82–115.
Chaudhuri, B. B., 2006. Digital Document Processing: Major Directions and Recent Advances. London: Springer.
Chawla, N. V. 2005. Data mining for imbalanced datasets: an overview. In Maimon, O., and Rokach, L. (eds.), Data Mining and Knowledge Discovery Handbook, pp. 853–67. New York: Springer.
Chen, Y., Ma, W., and Zhang, H., 2003. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the 12th International World Wide Web Conference, New York, NY, USA, pp. 225–33.
Collins, M., and Roark, B., 2004. Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 111–8.
Covington, M. A., 2001. A fundamental algorithm for dependency parsing. In Proceedings of ACM Southeast Conference, Athens, GA, USA, pp. 95–102.
Curran, J. R., and Wong, R. K., 1999. Transformation-based learning for automatic translation from HTML to XML. In Proceedings of the 4th Australasian Document Computing Symposium, Coffs Harbour, NSW, Australia, pp. 55–62.
Desmarais, F.-X., Gagnon, M., and Zouaq, A., 2012. Comparing a rule-based and a machine learning approach for semantic analysis. In Proceedings of the 6th International Conference on Advances in Semantic Processing, Barcelona, Spain, pp. 103–8.
Feng, J., Haffner, P., and Gilbert, M., 2005. A learning approach to discovering web page semantic structures. In Proceedings of the 8th International Conference on Document Analysis and Recognition, Washington, DC, pp. 1055–9.
Ganganwar, V., 2012. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering 2 (4): 42–7.
Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P., 2003. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web, New York, NY, USA, pp. 207–14.
Gupta, S., Kaiser, G., and Stolfo, S., 2005. Extracting context to improve accuracy for HTML content extraction. In Proceedings of the 14th International Conference on World Wide Web, New York, NY, USA, pp. 1114–5.
Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 3rd ed.New York: Springer.
Hobson, S. P., Dorr, B. J., Monz, C., and Schwartz, R., 2007. Task-based evaluation of text summarization using relevance prediction. Information Processing and Management 43 (6): 1482–99.
Indurkhya, N., and Damerau, F. J. 2010. Handbook of Natural Language Processing, 2nd ed.Boca Raton, FL: CRC Press.
Ingwersen, P., and Jarvelin, K., 2005. The Turn: Integration of Information Seeking and Retrieval in Context. Dordrecht: Springer.
Irmak, U., and Kraft, R., 2010. A scalable machine-learning approach for semi-structured named entity recognition. In Proceedings of the International World Wide Web Conference, New York, NY, USA, pp. 461–70.
Jensen, S. H., Madsen, M., and Moller, A., 2011. Modeling the HTML DOM and browser API in static analysis of JavaScript web applications. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary, pp. 59–69.
Joachims, T. 1999. Making large-scale SVM learning practical. In Schölkopf, B., Burges, C., and Smola, A. (eds.), Advances in Kernel Methods - Support Vector Learning. Cambridge, UK: MIT Press.
Joachims, T., 2002. Learning to Classify Text Using Support Vector Machines. Boston: Kluwer Academic Publishers.
Kao, H.-Y., Ho, J.-M., and Chen, M.-S., 2004. DOMISA: DOM-based information space adsorption for web information hierarchy mining. In Proceedings of the 4th SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, pp. 312–20.
Klink, S., Dengel, A., and Kieninger, T., 2000. Document structure analysis based on layout and textual features. In Proceedings of the International Workshop on Document Analysis Systems, Kaiserslautern, Germany, pp. 99–111.
Le, D. X., and Thoma, G. R., 2003. Automated document labeling for Web-based online medical journals. In Proceedings of the 7th World Multiconference on Systemics, Cybernetics and Informatics, Orlando, Florida, USA, pp. 411–15.
Li, Y., Wang, L., Wang, J., Yue, J., and Zhao, M., 2013. An approach of web page information extraction. In Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering, Hangzhou, China, pp. 2217–9.
Liu, Y., Wang, Q., and Wang, QX., 2006. A heuristic approach for topical information extraction from news pages. In Proceedings of the 7th International Conference on Web Information Systems Engineering, Wuhan, China, pp. 357–62.
Mandhani, B., and Meila, M., 2009. Tractable search for learning exponential models of rankings. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Clearwater, Florida, USA, pp. 392–9.
Mani, I., 2001. Automatic Summarisation. Amsterdam: John Benjamins.
Mani, I., Klein, G., House, D., Hirschman, L., Firmin, T., and Sundheim, B., 2002. SUMMAC: a text summarization evaluation. Natural Language Engineering 8 (1): 43–68.
Mao, S., Rosenfeld, A., and Kanungo, T., 2003. Document structure analysis algorithms: a literature survey. In Proceedings of SPIE Electronic Imaging, Santa Clara, California, USA, pp. 197–207.
Markey, K., 2007. Twenty-five years of end-user searching, Part 1: research findings. Journal of the American Society for Information Science and Technology 58 (8): 1071–81.
Mayfield, J., McNamee, P., Piatko, C., and Pearce, C., 2003. Lattice-based tagging using support vector machines. In Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, Los Angeles, USA, pp. 303–8.
Mukherjee, S., Yang, G., Tan, W., and Ramakrishnan, I. V., 2003. Automatic discovery of semantic structures in HTML documents. In Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, UK, pp. 245–9.
Niyogi, D., and Srihari, S. N., 1995. Knowledge-based derivation of document logical structure. In Proceedings of the International Conference on Document Analysis and Recognition, Montreal, Canada, pp. 472–5.
Pembe, F. C., and Güngör, T., 2009. Structure-preserving and query-biased document summarisation for web searching. Online Information Review 33 (4): 696–719.
Pinto, D., McCallum, A., Wei, X., and Croft, W. B., 2003. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 235–42.
Platt, J. C. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Smola, A., Bartlett, P., Scholkopf, B., and Schuurmans, D. (eds.), Advances in Large Margin Classifiers, pp. 61–74. Cambridge, UK: MIT Press.
Rahman, A. F. R., Alam, H., and Hartono, R. 2001. Content extraction from HTML documents. In Proceedings of the 1st International Workshop on Web Document Analysis. Seattle, Washington, USA.
Shilman, M., Liang, P., and Viola, P., 2005. Learning non-generative grammatical models for document analysis. In Proceedings of the 10th IEEE International Conference on Computer Vision, Beijing, China, pp. 962–9.
Tombros, A., and Sanderson, M., 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 2–10.
Weiss, G. M., McCarthy, K., and Zabar, B. 2007. Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In Proceedings of the International Conference on Data Mining, Las Vegas, Nevada, USA, pp. 35–41.
White, R. W., Jose, J. M., and Ruthven, I., 2003. A task-oriented study on the influencing effects of query-biased summarization in web searching. Information Processing and Management 39 (5): 707–33.
Xiao, X., Luo, Q., Hong, D., Fu, H., Xie, X., and Ma, W-Y. 2009. Browsing on small displays by transforming web pages into hierarchically structured subpages. ACM Transactions on the Web 3 (1): 4:1–4:36.
Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin, C. Y., and Li, H., 2007. Web page title extraction and its application. Information Processing and Management 43 (5): 1332–47.
Yang, C. C., and Wang, F. L., 2008. Hierarchical summarization of large documents. Journal of the American Society for Information Science and Technology 59 (6): 887–902.
Yang, Y., and Zhang, H. J., 2001. HTML page analysis based on visual cues. In Proceedings of the 6th International Conference on Document Analysis and Recognition, Washington, USA, p. 859.