Senior Project - March 2024
Real Estate Data Regression Model Visualization in R
Scroll ↓
Overview
Within the course Applied Regression Analysis, we were tasked with creating a statistical regression model using R based on a dataset of our choice. We chose to use a Real Estate dataset and our project was a comprehensive analysis aimed at understanding the factors influencing property prices within a specific market. Leveraging the R programming language, this project spanned from data collection to detailed statistical modeling, with the goal of identifying key property features that significantly affect real estate values. By applying a blend of exploratory data analysis, categorical and numerical variable assessment, and advanced statistical techniques, this project provided a holistic view of the real estate landscape, guiding potential investment and pricing strategies in the competitive market.
Statistical Tests and Models
Building on the cleaned dataset, various statistical models were developed to quantify the relationship between property prices and their characteristics. The project employed linear regression models, starting with a full model incorporating all predictors, followed by the construction of reduced models through systematic variable selection and transformation. Techniques such as variance inflation factor (VIF) analysis, ANOVA tests, and Box-Cox transformations were applied to refine the models, addressing multicollinearity and improving model fit. This iterative modeling process enabled the identification of significant predictors and their interactions, revealing the nuanced dynamics of real estate pricing.
Categorical Vs. Numerical
Categorical variables like AC, Pool, and Quality are converted to factors in R, allowing these variables to be appropriately included in linear regression models. The factor()
function is used to specify these as categorical data, with contrasts
set to define reference levels or order, essential for interpreting the model's coefficients related to these variables.
Numerical variables such as Sqft, Bed, Bath, and Age are treated directly as quantitative inputs in the model. These variables are used to explore linear relationships with the dependent variable (Price), and their continuous nature allows for detailed and nuanced analysis of how changes in these variables affect real estate prices.
Insights and Implications
The culmination of the project yielded actionable insights into the real estate market, with clear implications for stakeholders. Key findings included the significant impact of property size, age, and specific amenities on price, alongside the importance of location quality. The project demonstrated the power of statistical analysis in uncovering market trends and informing decision-making processes. For potential investors, real estate professionals, and policymakers, the analysis provided evidence-based guidance for assessing property values and strategizing accordingly. The project underscored the value of data-driven analysis in navigating the complexities of the real estate market, offering a replicable framework for future research and application.
Data Preprocessing and Exploration
In the initial phase, the project focused on preparing and exploring the real estate dataset, which involved reading the data into R, cleaning, and structuring it for analysis. Categorical variables such as air conditioning presence, pool availability, and property quality were meticulously converted into factors, while numerical variables like square footage, age, and number of bedrooms were analyzed for their distribution and impact potential. This stage was crucial for ensuring data quality and integrity, setting a solid foundation for the subsequent analytical process. Visualization techniques, including 3D scatter plots and correlation matrices, provided initial insights into the relationships between variables, highlighting potential predictors of real estate prices.
Final Visualization:
Real Estate Data: