Skip to main content Accessibility help
×
Home

Mining, analyzing, and modeling text written on mobile devices

  • K. Vertanen (a1) and P.O. Kristensson (a2)

Abstract

We present a method for mining the web for text entered on mobile devices. Using searching, crawling, and parsing techniques, we locate text that can be reliably identified as originating from 300 mobile devices. This includes 341,000 sentences written on iPhones alone. Our data enables a richer understanding of how users type “in the wild” on their mobile devices. We compare text and error characteristics of different device types, such as touchscreen phones, phones with physical keyboards, and tablet computers. Using our mined data, we train language models and evaluate these models on mobile test data. A mixture model trained on our mined data, Twitter, blog, and forum data predicts mobile text better than baseline models. Using phone and smartwatch typing data from 135 users, we demonstrate our models improve the recognition accuracy and word predictions of a state-of-the-art touchscreen virtual keyboard decoder. Finally, we make our language models and mined dataset available to other researchers.

Copyright

Corresponding author

*Corresponding author. Email: vertanen@mtu.edu

References

Hide All
Baldwin, T. and Chai, J. (2012). Autonomous self-assessment of autocorrections: exploring text message dialogues. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada: Association for Computational Linguistics, pp. 710719.
Bell, P., Yamamoto, H., Swietojanski, P., Wu, Y., McInnes, F., Hori, C. and Renals, S. (2013). A lecture transcription system combining neural network acoustic and Language Models. In Proceedings of INTERSPEECH. ISCA, pp. 30873091.
Bisani, M. and Ney, H. (2004). Bootstrap estimates for confidence intervals in ASR performance evaluation. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing. ICASSP’04. IEEE, pp. 409411.
Brill, E. and Moore, R.C. (2000). An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. ACL’00. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 286293.
Brody, S. and Diakopoulos, N. (2011). Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using word lengthening to detect sentiment in microblogs. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK: Association for Computational Linguistics, pp. 562570.
Bulyko, I., Ostendorf, M., Siu, M., Ng, T., Stolcke, A. and Çetin, Ö. (2007). Web resources for language modeling in conversational speech recognition. ACM Transactions on Speech and Language Processing 5(1), 1:11:25.
Burton, K., Java, A. and Soboroff, I. (2009). The ICWSM 2009 Spinn3r dataset. In: Proceedings of the 3rd Annual Conference on Weblogs and Social Media. ICWSM’09. Palo Alto, California, USA: AAAI.
Carey, J. (1980). Paralanguage in computer mediated communication. In Proceedings of the 18th Annual Meeting on Association for Computational Linguistics. ACL’80. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 6769.
Chelba, C., Brants, T., Neveitt, W. and Xu, P. (2010). Study on interaction between entropy pruning and Kneser–Ney smoothing. In Proceedings of INTERSPEECH. ISCA, pp. 22422245.
Chen, B., Kuhn, R., Foster, G., Cherry, C. and Huang, F. (2016). Bilingual methods for adaptive training data selection for machine translation. In Proceedings of the Association for Machine Translation in the Americas. AMTA’16, pp. 93103.
Chen, S.F., Beeferman, D. and Rosenfeld, R. (1998). Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, pp. 275280.
Chen, S.F. and Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics. ACL’96. Morristown, NJ, USA: Association for Computational Linguistics, pp. 310318.
Chen, T. and Kan, M.-Y. (2013). Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47(2), 299335.
Cooper, W.E. (1983). Cognitive Aspects of Skilled Typewriting. New York: Springer-Verlag.
Creutz, M., Virpioja, S. and Kovaleva, A. (2009). Web augmentation of language models for continuous speech recognition of SMS text messages. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. EACL’09. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 157165.
Darragh, J.J., Witten, I.H. and James, M.L. (1990). The reactive keyboard: a predictive typing aid. Computer 23(11), 4149.
De Mulder, W., Bethard, S. and Moens, M.-F. (2015). A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language 30(1), 6198.
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M. and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Proceedings of the Conference on Computational Linguistics. ACL’14. Baltimore, USA: Association for Computational Linguistics, pp. 13701380.
Fowler, A., Partridge, K., Chelba, C., Bi, X., Ouyang, T. and Zhai, S. (2015). Effects of language modeling and its personalization on touchscreen typing performance. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’15. New York, NY, USA: ACM, 649658.
Fu, B., Lin, J., Li, L., Faloutsos, C., Hong, J. and Sadeh, N. (2013). Why people hate your app: making sense of user feedback in a mobile app store. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’13. New York, NY, USA: ACM, pp. 12761284.
Gao, J., Goodman, J., Li, M. and Lee, K.-F. (2002). Toward a unified approach to statistical language modeling for chinese. ACM Transactions on Asian Language Information Processing (TALIP) 1(1), 333.
Gillick, L. and Cox, S.J. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing. ICASSP’89. IEEE, pp. 532535.
Goodman, J., Venolia, G., Steury, K. and Parker, C. (2002). Language modeling for soft keyboards. In Proceedings of the Eighteenth National Conference on Artificial Intelligence. Menlo Park, CA, USA: American Association for Artificial Intelligence, pp. 419424.
Grinter, R. and Eldridge, M. (2003). Wan2Tlk?: everyday text messaging. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’03. New York, NY, USA: ACM, pp. 441448.
Han, B. and Baldwin, T. (2011). Lexical normalisation of short text messages: makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. HLT’11. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 368378.
Hayes, A.F. and Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1(1), 7789.
Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, pp. 187197.
Hunt, M.J. (1990). Figures of merit for assessing connected-word recognisers. Speech Communication 9(4), 329336.
Kalman, Y.M. and Gergle, D. (2009). Letter and punctuation mark repeats as cues in computer-mediated communication. In 95th Annual Meeting of the National Communication Association in Chicago, IL.
Kamvar, M. and Baluja, S. (2007). Deciphering trends in mobile search. IEEE Computer 40(8), 5862.
Klimt, B. and Yang, Y. (2004). The enron corpus: a new dataset for email classification research. In Proceedings of the European Conference on Machine Learning. Springer-Verlag, pp. 217226.
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain: Association for Computational Linguistics, pp. 388395.
Kombrink, S., Mikolov, T., Karafiát, M. and Burget, L. (2011). Recurrent neural network based language modeling in meeting recognition. In Proceedings of INTERSPEECH. ISCA, vol. 11, pp. 28772880.
Kristensson, P.O. and Vertanen, K. (2012). Performance comparisons of phrase sets and presentation styles for text entry evaluations. In Proceedings of the 2012 ACM International Conference on Intelligent User Interfaces. IUI’12. New York, NY, USA: ACM, 2932.
Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), 377439.
Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, vol. 10, pp. 707710. Available at https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf
Ling, R. (2005). The sociolinguistics of SMS: an analysis of SMS use by a random sample of Norwegians. In Ling, R. and Pedersen, P. E. (eds), Mobile Communications. London: Springer-Verlag London Limited, Springer, pp. 335349.
Ling, R. (2007). The Length of Text Messages and the Use of Predictive Texting: Who Uses It and How Much Do They Have to Say? TESOL, College of Arts and Sciences, American University.
Lui, M. and Baldwin, T. (2012). langid.py: an off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations. ACL’12. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 2530.
Maalej, W. and Nabil, H. (2015). Bug report, feature request, or simply praise? On automatically classifying app reviews. In Proceedings of the 2015 IEEE 23rd International Requirements Engineering Conference (RE). IEEE, pp. 116125.
Mikolov, T., Deoras, A., Kombrink, S., Burget, L. and Cernocký, J. (2011). Empirical evaluation and combination of advanced language modeling techniques. In Proceedings of INTERSPEECH. ISCA, pp. 605608.
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J. and Khudanpur, S. (2010). Recurrent neural network based language model. In Proceedings of INTERSPEECH. ISCA, pp. 10451048.
Moore, R.C. and Lewis, W. (2010). Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers. ACLShort’10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 220224.
Munro, R. (2011). Subword and spatiotemporal models for identifying actionable information in Haitian Kreyol. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning. CoNLL’11. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 6877.
Munro, R. and Manning, C.D. (2010). Subword variation in text message classification. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 510518.
Munro, R. and Manning, C.D. (2012). Short message communications: users, topics, and in-language processing. In: Proceedings of the 2nd ACM Symposium on Computing for Development. ACM.
Neviarouskaya, A., Prendinger, H. and Ishizuka, M. (2007). Textual affect sensing for sociable and expressive online communication. In Proceedings of the 2nd International Conference on Affective Computing and Intelligent Interaction. ACII’07. Berlin, Heidelberg: Springer-Verlag, pp. 218229.
O’Day, D.R. and Calix, R. (2013). Text message corpus: applying natural language processing to mobile device forensics. In Proceedings of the 2013 IEEE International Conference on Multimedia and Expo Workshops. ICMEW’13. IEEE, pp. 16.
Paek, T. and Hsu, B.-J. (Paul). (2011). Sampling representative phrase sets for text entry experiments: a procedure and public resource. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’11. New York, NY, USA: ACM, pp. 24772480.
Pauls, A. and Klein, D. (2011). Faster and smaller N-gram language models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. HLT’11. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 258267.
Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In Proceedings of the ACL Student Research Workshop. ACLstudent’05. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 4348.
Renals, S. (2010). Recognition and understanding of meetings. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. HLT’10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 19.
Riordan, M.A. and Kreuz, R.J. (2010). Cues in computer-mediated communication: a corpus analysis. Computers in Human Behavior 26(6), 18061817.
Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go From here? In Proceedings of the IEEE. IEEE, vol. 88, pp. 12701278.
Rough, D., Vertanen, K. and Kristensson, P.O. (2014). An evaluation of dasher with a high-performance language model as a gaze communication method. In Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces. AVI’14. New York, NY, USA: ACM, pp. 169176.
Schnoebelen, T. (2012). Do you smile with your nose? Stylistic variation in twitter emoticons. University of Pennsylvania Working Papers in Linguistics 18(2), 14.
Shaoul, C. and Westbury, C. (2009). A USENET Corpus (2005–2009). http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html. University of Alberta, Edmonton, AB.
Stolcke, A. (1998). Entropy-based pruning of backoff language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, pp. 270274.
Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. In Proceedings of INTERSPEECH. ISCA, pp. 901904.
Stolcke, A., Yuret, D. and Madnani, N. (2010). SRILM-FAQ - Frequently Asked Questions About SRI LM Tools. http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html.
Stolcke, A., Zheng, J., Wang, W. and Abrash, V. (2011). SRILM at sixteen: update and outlook. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. ASRU’11. IEEE, vol. 5.
Strik, H., Cucchiarini, C. and Kessens, J.M. (2001). Comparing the performance of two CSRs: how to determine the significance level of the differences. In Proceedings of INTERSPEECH. ISCA, pp. 20912094.
Tagg, C. (2009). A Corpus Linguistics Study of SMS Text Messaging. PhD Thesis, University of Birmingham, Birmingham, UK.
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D. and Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology 61(12), 25442558.
Tong, X. and Evans, D.A. (1996). A statistical approach to automatic OCR error correction in context. In Proceedings of the Fourth Workshop on Very Large Corpora. Association for Computational Linguistics, pp. 88100.
Vasa, R., Hoon, L., Mouzakis, K. and Noguchi, A. (2012). A preliminary analysis of mobile app user reviews. In Proceedings of the 24th Australian Computer-Human Interaction Conference. OzCHI’12. New York, NY, USA: ACM, pp. 241244.
Vertanen, K., Fletcher, C., Gaines, D., Gould, J. and Kristensson, P.O. (2018). The impact of word, multiple word, and sentence input on virtual keyboard decoding performance. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’18. New York, NY, USA: ACM, pp. 626:1626:12.
Vertanen, K. and Kristensson, P.O. (2011a). The imagination of crowds: conversational AAC language modeling using crowdsourcing and large data sources. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK: Association for Computational Linguistics, pp. 700711.
Vertanen, K. and Kristensson, P.O. (2011b). A versatile dataset for text entry evaluations based on genuine mobile emails. In Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services. MobileHCI’11. New York, NY, USA: ACM, pp. 295298.
Vertanen, K. and Kristensson, P.O. (2014). Complementing text entry evaluations with a composition task. ACM Transactions on Computer-Human Interaction 21(2), 8:18:33.
Vertanen, K., Memmi, H., Emge, J., Reyal, S. and Kristensson, P.O. (2015). VelociTap: investigating fast mobile text entry using sentence-based decoding of touchscreen keyboard input. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’15. New York, NY, USA: ACM, pp. 659668.
Walther, J.B. and D’Addario, K.P. (2001). The impacts of emoticons on message interpretation in computer-mediated communication. Social Science Computer Review 19(3), 324347.
Ward, D.J., Blackwell, A.F. and MacKay, D.J.C. (2000). Dasher - a data entry Interface using continuous gestures and language models. In Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology. UIST’00. New York, NY, USA: ACM, pp. 129137.
Wobbrock, J.O. (2007). Measures of text entry performance, Chapter 3. In MacKenzie, I.S. and Tanaka-Ishii, K. (eds), Text Entry Systems. San Francisco, California, USA: Morgan Kauffman, pp. 4774.
Yao, K., Zweig, G., Hwang, M.-Y., Shi, Y. and Yu, D. (2013). Recurrent neural networks for language understanding. In Proceedings of INTERSPEECH. ISCA, pp. 25242528.

Keywords

Mining, analyzing, and modeling text written on mobile devices

  • K. Vertanen (a1) and P.O. Kristensson (a2)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed