Skip to main content
×
×
Home

A tree-based learning approach for document structure analysis and its application to web search

  • F. CANAN PEMBE (a1) and TUNGA GÜNGÖR (a2)
Abstract

In this paper, we study the problem of structural analysis of Web documents aiming at extracting the sectional hierarchy of a document. In general, a document can be represented as a hierarchy of sections and subsections with corresponding headings and subheadings. We developed two machine learning models: heading extraction model and hierarchy extraction model. Heading extraction was formulated as a classification problem whereas a tree-based learning approach was employed in hierarchy extraction. For this purpose, we developed an incremental learning algorithm based on support vector machines and perceptrons. The models were evaluated in detail with respect to the performance of the heading and hierarchy extraction tasks. For comparison, a baseline rule-based approach was used that relies on heuristics and HTML document object model tree processing. The machine learning approach, which is a fully automatic approach, outperformed the rule-based approach. We also analyzed the effect of document structuring on automatic summarization in the context of Web search. The results of the task-based evaluation on TREC queries showed that structured summaries are superior to unstructured summaries both in terms of accuracy and user ratings, and enable the users to determine the relevancy of search results more accurately than search engine snippets.

Copyright
References
Hide All
Alpaydın, E., 2004. Introduction to Machine Learning. Cambridge, UK: MIT Press.
Baeza-Yates, R., and Ribeiro-Neto, B., 1999. Modern Information Retrieval. New York: Addison-Wesley.
Branavan, S. R. K., Deshpande, P., and Barzilay, R. 2007. Generating a table-of-contents. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 176183, Prague, Czech Republic.
Brugger, R., Zramdini, A., and Ingold, R., 1997. Modeling documents for structure recognition using generalized n-grams. In Proceedings of International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 5660.
Buyukkokten, O., Garcia-Molina, H., and Paepcke, A., 2001. Accordion summarization for end-game browsing on PDAs and cellular phones. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York, NY, USA, pp. 213220.
Buyukkokten, O., Kaljuvee, O., Garcia-Molina, H., Paepcke, A., and Winograd, T., 2002. Efficient web browsing on handheld devices using page and form summarization. ACM Transactions on Information Systems 20 (1): 82115.
Chaudhuri, B. B., 2006. Digital Document Processing: Major Directions and Recent Advances. London: Springer.
Chawla, N. V. 2005. Data mining for imbalanced datasets: an overview. In Maimon, O., and Rokach, L. (eds.), Data Mining and Knowledge Discovery Handbook, pp. 853–67. New York: Springer.
Chen, Y., Ma, W., and Zhang, H., 2003. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the 12th International World Wide Web Conference, New York, NY, USA, pp. 225–33.
Cobra: Java HTML Renderer and Parser 2010. http://lobobrowser.org/cobra.jsp.
Collins, M., and Roark, B., 2004. Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 111–8.
Covington, M. A., 2001. A fundamental algorithm for dependency parsing. In Proceedings of ACM Southeast Conference, Athens, GA, USA, pp. 95102.
Curran, J. R., and Wong, R. K., 1999. Transformation-based learning for automatic translation from HTML to XML. In Proceedings of the 4th Australasian Document Computing Symposium, Coffs Harbour, NSW, Australia, pp. 5562.
Desmarais, F.-X., Gagnon, M., and Zouaq, A., 2012. Comparing a rule-based and a machine learning approach for semantic analysis. In Proceedings of the 6th International Conference on Advances in Semantic Processing, Barcelona, Spain, pp. 103–8.
Document Object Model 2012. http://www.w3.org/dom.
Feng, J., Haffner, P., and Gilbert, M., 2005. A learning approach to discovering web page semantic structures. In Proceedings of the 8th International Conference on Document Analysis and Recognition, Washington, DC, pp. 1055–9.
Ganganwar, V., 2012. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering 2 (4): 42–7.
Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P., 2003. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web, New York, NY, USA, pp. 207–14.
Gupta, S., Kaiser, G., and Stolfo, S., 2005. Extracting context to improve accuracy for HTML content extraction. In Proceedings of the 14th International Conference on World Wide Web, New York, NY, USA, pp. 1114–5.
Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 3rd ed.New York: Springer.
Hobson, S. P., Dorr, B. J., Monz, C., and Schwartz, R., 2007. Task-based evaluation of text summarization using relevance prediction. Information Processing and Management 43 (6): 1482–99.
HTML 4.01 Specification 1999. http://www.w3.org/TR/html401/.
Indurkhya, N., and Damerau, F. J. 2010. Handbook of Natural Language Processing, 2nd ed.Boca Raton, FL: CRC Press.
Ingwersen, P., and Jarvelin, K., 2005. The Turn: Integration of Information Seeking and Retrieval in Context. Dordrecht: Springer.
Irmak, U., and Kraft, R., 2010. A scalable machine-learning approach for semi-structured named entity recognition. In Proceedings of the International World Wide Web Conference, New York, NY, USA, pp. 461–70.
Jensen, S. H., Madsen, M., and Moller, A., 2011. Modeling the HTML DOM and browser API in static analysis of JavaScript web applications. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary, pp. 5969.
Joachims, T. 1999. Making large-scale SVM learning practical. In Schölkopf, B., Burges, C., and Smola, A. (eds.), Advances in Kernel Methods - Support Vector Learning. Cambridge, UK: MIT Press.
Joachims, T., 2002. Learning to Classify Text Using Support Vector Machines. Boston: Kluwer Academic Publishers.
Kao, H.-Y., Ho, J.-M., and Chen, M.-S., 2004. DOMISA: DOM-based information space adsorption for web information hierarchy mining. In Proceedings of the 4th SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, pp. 312–20.
Klink, S., Dengel, A., and Kieninger, T., 2000. Document structure analysis based on layout and textual features. In Proceedings of the International Workshop on Document Analysis Systems, Kaiserslautern, Germany, pp. 99111.
Le, D. X., and Thoma, G. R., 2003. Automated document labeling for Web-based online medical journals. In Proceedings of the 7th World Multiconference on Systemics, Cybernetics and Informatics, Orlando, Florida, USA, pp. 411–15.
Li, Y., Wang, L., Wang, J., Yue, J., and Zhao, M., 2013. An approach of web page information extraction. In Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering, Hangzhou, China, pp. 2217–9.
Liu, Y., Wang, Q., and Wang, QX., 2006. A heuristic approach for topical information extraction from news pages. In Proceedings of the 7th International Conference on Web Information Systems Engineering, Wuhan, China, pp. 357–62.
Mandhani, B., and Meila, M., 2009. Tractable search for learning exponential models of rankings. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Clearwater, Florida, USA, pp. 392–9.
Mani, I., 2001. Automatic Summarisation. Amsterdam: John Benjamins.
Mani, I., Klein, G., House, D., Hirschman, L., Firmin, T., and Sundheim, B., 2002. SUMMAC: a text summarization evaluation. Natural Language Engineering 8 (1): 4368.
Mao, S., Rosenfeld, A., and Kanungo, T., 2003. Document structure analysis algorithms: a literature survey. In Proceedings of SPIE Electronic Imaging, Santa Clara, California, USA, pp. 197207.
Markey, K., 2007. Twenty-five years of end-user searching, Part 1: research findings. Journal of the American Society for Information Science and Technology 58 (8): 1071–81.
Mayfield, J., McNamee, P., Piatko, C., and Pearce, C., 2003. Lattice-based tagging using support vector machines. In Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, Los Angeles, USA, pp. 303–8.
Mukherjee, S., Yang, G., Tan, W., and Ramakrishnan, I. V., 2003. Automatic discovery of semantic structures in HTML documents. In Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, UK, pp. 245–9.
Niyogi, D., and Srihari, S. N., 1995. Knowledge-based derivation of document logical structure. In Proceedings of the International Conference on Document Analysis and Recognition, Montreal, Canada, pp. 472–5.
Pembe, F. C., and Güngör, T., 2009. Structure-preserving and query-biased document summarisation for web searching. Online Information Review 33 (4): 696719.
Pinto, D., McCallum, A., Wei, X., and Croft, W. B., 2003. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 235–42.
Platt, J. C. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Smola, A., Bartlett, P., Scholkopf, B., and Schuurmans, D. (eds.), Advances in Large Margin Classifiers, pp. 6174. Cambridge, UK: MIT Press.
Rahman, A. F. R., Alam, H., and Hartono, R. 2001. Content extraction from HTML documents. In Proceedings of the 1st International Workshop on Web Document Analysis. Seattle, Washington, USA.
Shilman, M., Liang, P., and Viola, P., 2005. Learning non-generative grammatical models for document analysis. In Proceedings of the 10th IEEE International Conference on Computer Vision, Beijing, China, pp. 962–9.
Text Retrieval Conference 2010. http://trec.nist.gov.
Tombros, A., and Sanderson, M., 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 210.
Weiss, G. M., McCarthy, K., and Zabar, B. 2007. Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In Proceedings of the International Conference on Data Mining, Las Vegas, Nevada, USA, pp. 3541.
White, R. W., Jose, J. M., and Ruthven, I., 2003. A task-oriented study on the influencing effects of query-biased summarization in web searching. Information Processing and Management 39 (5): 707–33.
Xiao, X., Luo, Q., Hong, D., Fu, H., Xie, X., and Ma, W-Y. 2009. Browsing on small displays by transforming web pages into hierarchically structured subpages. ACM Transactions on the Web 3 (1): 4:1–4:36.
Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin, C. Y., and Li, H., 2007. Web page title extraction and its application. Information Processing and Management 43 (5): 1332–47.
Yang, C. C., and Wang, F. L., 2008. Hierarchical summarization of large documents. Journal of the American Society for Information Science and Technology 59 (6): 887902.
Yang, Y., and Zhang, H. J., 2001. HTML page analysis based on visual cues. In Proceedings of the 6th International Conference on Document Analysis and Recognition, Washington, USA, p. 859.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 4
Total number of PDF views: 43 *
Loading metrics...

Abstract views

Total abstract views: 486 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 12th June 2018. This data will be updated every 24 hours.