Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-28T14:00:54.544Z Has data issue: false hasContentIssue false

Describe the house and I will tell you the price: House price prediction with textual description data

Published online by Cambridge University Press:  18 July 2023

Hanxiang Zhang
Affiliation:
School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada
Yansong Li
Affiliation:
School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada
Paula Branco*
Affiliation:
School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada
*
Corresponding author: Paula Branco; Email: pbranco@uottawa.ca
Rights & Permissions [Opens in a new window]

Abstract

House price prediction is an important problem that could benefit home buyers and sellers. Traditional models for house price prediction use numerical attributes such as the number of rooms but disregard the house description text. The recent developments in text processing suggest these can be valuable attributes, which motivated us to use house descriptions. This paper focuses on the house asking/advertising price and studies the impact of using house description texts to predict the final house price. To achieve this, we collected a large and diverse set of attributes on house postings, including the house advertising price. Then, we compare the performance of three scenarios: using only the house description, only numeric attributes, or both. We processed the description text through three word embedding techniques: TF-IDF, Word2Vec, and BERT. Four regression algorithms are trained using only textual data, non-textual data, or both. Our results show that by using exclusively the description data with Word2Vec and a Deep Learning model, we can achieve good performance. However, the best overall performance is obtained when using both textual and non-textual features. An $R^2$ of 0.7904 is achieved by the deep learning model using only description data on the testing data. This clearly indicates that using the house description text alone is a strong predictor for the house price. However, when observing the RMSE on the test data, the best model was gradient boosting using both numeric and description data. Overall, we observe that combining the textual and non-textual features improves the learned model and provides performance benefits when compared against using only one of the feature types. We also provide a freely available application for house price prediction, which is solely based on a house text description and uses our final developed model with Word2Vec and Deep Learning to predict the house price.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Example of the description text of a house listing obtained from the real estate listings.

Figure 1

Table 1. The 10 top non-textual description features and corresponding F-values.

Figure 2

Figure 2. Overall house price distribution across all five cities.

Figure 3

Table 2. Main characteristics of the numeric features on the top 10 most important features determined in Table 1

Figure 4

Table 3. Main characteristics of the non-numeric features on the top 10 most important features determined in Table 1.

Figure 5

Figure 3. Quantile–Quantile plot of the house price distribution.

Figure 6

Figure 4. House prices distribution by city (top left: Toronto; top right: Ottawa middle left: Hamilton; middle right: Mississauga; bottom: Brampton).

Figure 7

Figure 5. Overall distribution of feature ”bathroomTotal.”

Figure 8

Figure 6. Overall distribution of feature “totalParkingSpaces.”

Figure 9

Figure 7. Overall distribution of feature “bedroomAboveGrade.”

Figure 10

Figure 8. Word Cloud generated using the textual description data from the 5% cheapest houses.

Figure 11

Figure 9. Word Cloud generated using the textual description data from the 5% most expensive houses.

Figure 12

Figure 10. Word Cloud generated using the textual description data from the houses with average price.

Figure 13

Table 4. Word2Vec and GloVe model results with the mean value of 10-fold cross-validation over four runs.

Figure 14

Figure 11. Nemenyi post hoc test of word embedding models with the mean of four repetitions of 10-fold cross-validation.

Figure 15

Table 5. Grid search parameters and values

Figure 16

Table 6. Best $R^2$ scores for all algorithms

Figure 17

Table 7. Best $\text{RMSE}$ scores for all algorithms

Figure 18

Figure 12. Nemenyi post hoc test of $R^2$ scores for all algorithms with the mean of four repetitions of 10-fold cross-validation.

Figure 19

Figure 13. Nemenyi post hoc test of $\text{RMSE}$ scores for all algorithms with the mean of four repetitions of 10-fold cross-validation.

Figure 20

Table 8. Best algorithms and the best parameters for all three input data types.

Figure 21

Table 9. Result for final models.

Figure 22

Figure 14. Web application initial page (left) and example of a price prediction for a 1-bedroom condo (right), using the developed App.

Figure 23

Figure 15. Price prediction for a 1-bedroom condo in Ottawa (left), and price prediction for a 3-bedroom house in Ottawa near the Ottawa River (right), using the developed App.

Figure 24

Table A1. Detailed results of the algorithms and respective best parameter settings found through the grid search.

Figure 25

Figure A1. Nemenyi post hoc test of $R^2$ scores for non-textual description features.

Figure 26

Figure A2. Nemenyi post hoc test of $R^2$ scores for textual description features.

Figure 27

Figure A3. Nemenyi post hoc test of $R^2$ scores for all features.

Figure 28

Figure A4. Nemenyi post hoc test of $\text{RMSE}$ scores for non-textual description features.

Figure 29

Figure A5. Nemenyi post hoc test of $\text{RMSE}$ scores for textual description features.

Figure 30

Figure A6. Nemenyi post hoc test of $\text{RMSE}$ scores for all features.