WELCOME

Dear Visitor,

Welcome to my website!

My portfolio showcases a range of diverse projects I've been involved in, demonstrating my passion for data analysis and the insights it can uncover. Among these projects are:

1. "Harry Potter and the Deathly Hallows Part One Sentiment Analysis and Social Network Relationship" explores the sentiments and interpersonal dynamics within the famous Harry Potter franchise.

2. "Data Analytics Project with Yellow Dig": A collaborative project that applies data analysis strategies to real-world situations.

3. "Analytics Consulting Project with PWC": An in-depth consulting project with one of the most respected firms in the world, where I applied my data analytics skills to solve complex problems related to the U.S. population's Covid-19 vaccination intention.

4. "2021 League of Legends Linear Regression Analysis": A deep dive into the statistics of one of the most popular online games, using linear regression models to predict game outcomes.

5. "Analysis of housing market in Two US cities From 2020 to 2022.": My most recent project, where I analyzed the housing market trends in two major US cities over a two-year period

I firmly believe that we live in an era saturated with data, yet our ability to transform this data into actionable information lags. This is where my work as a data analytics professional comes in. By employing sophisticated analysis methods, I aim to bridge this gap, thereby enhancing decision-making progress, improving project outcomes, and reducing project costs.

Feel free to explore, and I welcome any feedback that could help me improve.

Best Regards

Luke

Harry Potter and the Deathly Hallows Part One Sentiment Analysis and Social Network Relationship

Harry Potter holds a special place in my heart, as it was the first novel ever read from the Western world. I can still vividly recall those long summer days I lying next to my grandfather, engrossed in the pages of this enchanting tale.

The magical world crafted by J.K. Rowling captivated my young imagination, transporting me to a place far beyond the confines of my everyday life.

I still recall the mixed feelings I experienced when I first watched the inaugural movie in the series. I was entranced, yet also deeply disappointed - I remember asking my mom, with a heart heavy with longing. " why didn't they pick me to go to Hogwarts?" This poignant memory of my childhood continues to resonate with me.

The inspiration for this project came from Professor Murugan Anandarajan, affectionately known as Doctor A. He tasked us with undertaking a Natural Language Processing (NLP) project related to the Harry Potter movies. Our group was assigned the last film in the series, 'Harry Potter and the Deathly Hallows Part One ', which we all thoroughly enjoyed.

Capitalizing on our NLP skills and enthusiasm for the film, we decided to use technology to streamline the role-playing game (RPG) development process. Our ultimate goal was to create a data-driven approach to game design.

Our first objective was to identify the key characters using social network analysis. Next, we sought to comprehend the overall mood and tone of the game using sentiment analysis. With this information, we aimed to create dialogues for these characters. The choice of words, tone of the dialogues, and character interactions were all designed based on word frequency and sentiment analysis.

By integrating these elements, we strived to create a more immersive and engaging gaming experience, one that would truly capture the essence of 'Harry Potter and the Deathly Hallows Part One'.

Social Network Analysis

Social Network Analysis (SNA) is a methodological approach that focuses on understanding the relationships among social entities and the patterns and implications these relationships create. It serves as a tool to map and measure the interactions and flows between various entities, such as people, groups, organizations, computers, URLs, and other connected information or knowledge entities.

Three Crucial components come into play when visualizing networks: nodes, lines and layout. In the context of lines, networks can be either directed or undirected. In undirected networks, the lines are termed 'edges.' For instance, an edge would be appropriate for a study analyzing the relationship between Harry and Hermione from the Harry Potter series. This is because their interaction in the narrative is mutual, and no directed line is needed to represent a one-way interaction.

In contrast, directed networks feature 'arcs,' which essentially denoted lines with arrows indicating the directionality of the tie. The concept of reciprocity of the tie becomes an important feature in these networks. Consider a hypothetical study on children's friendships: understanding who nominates whom as a friend and whether this nomination is reciprocated could provide significant insight into the nature of these relationships. It's more likely that a friendship is mutually recognized when the nomination is reciprocated. By paying close attention to these details, SNA can provide a deeper understanding of the dynamics within social networks.

In this graph. I've used the level of reciprocation to determine the size of the 'Evil' network. As we can see, Lord Voldemort stands as the largest node, reflecting his pivotal role in this network.

Lord Voldemort's request for the Elder Wand, which leads him to interact with Gregorovitch, Ollivander, and Grindelwald, significantly highlights his dominance and central role in the story. His connections with these key characters expand his network and emphasize his pivotal role and significant influence within the unfolding narrative. These interactions illustrate how his actions and decisions propel the storyline, further reinforcing his centrality and importance to the plot.

A cluster in a social network refers to a group of nodes that are more interconnected among themselves than with other nodes in the network. These clusters can provide valuable insights about collections of individuals and the dynamics of their interactions.

One important technique used in the study of clusters is community detection, which aims to identify such tightly-knit groups within a larger network. This can help us understand how different sub-networks or communities interact with each other and whether there are distinct communities within the larger network. This graph shows that the green cycle at the left corner represents the Ron Weasley family.

While clustering provides significant insights, it's important to recognize its limitations. For instance, consider the character Mr Scabior from the Harry Potter series. In the first part of 'Deathly Hallows', Scabior captures Harry, Hermione, and Ron and takes them to Bellatrix's house. Though he plays an antagonistic role, the clustering algorithm could potentially misidentify him as a part of the 'good' group due to his frequent interactions with the main characters. This highlights the importance of considering the context and nature of interactions when interpreting cluster analysis results.

Another critical application of clustering lies in community detection. Let's consider the study of communicable diseases. Here, it's essential to identify individuals connected to many others, say 20 people. These individuals, often called 'super spreaders', can significantly impact the disease's spread due to their extensive network of contacts. Identifying and examining these clusters can better understand disease transmission patterns and devise more effective containment strategies. In our graph, we need to pay more attention to Harry, Hermione, Ron, and Lord Voldemort. This is a prime example of how the principles of Social Network Analysis can be applied to the health care industry.

Screenshot 2023-05-14 at 12.51.09 PM.png

In continuation of our exploration of communicable diseases using social network analysis, we could see clusters to identify distinct groups within a community and pinpoint the most influential individuals within each cluster. However, to identify individuals who connect different subgroups, we must consider a concept called "betweenness centrality."

Betweenness centrality is a measure of a node's role as the bridge between other nodes. In simple terms, it quantifies how often a node serves as link, or 'bridge,' connecting various parts of the network. If we look at the 'good characters' network from Harry Potter, Harry has the highest betweenness centrality. This means he often serves as a bridge, connecting different groups of characters together. People with high betweenness centrality tend to unify the network.

Relating back to our disease control example, we should pay particular attention to individuals who connect numerous groups. These individuals can serve as significant transmission points for diseases, given their extensive connections. Remember, different people hold importance for various reasons within a network, so our understanding is often a composite picture, a blend of different measures and perspectives.

In the context of a network:

1. A "hub" is a node that has many outbound links; it points to many other nodes. In a social network, a person who follows many others would be considered a hub. In the context of web pages, a page that contains many links to other pages is a hub.

2. An " authority" is a node with many inbound links; many other nodes point to it. In a social network, a person who is followed by many others is an authority. In terms of web pages, a page that is linked to by many others is an authority.

Eigenvector centrality is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.

In simpler terms, eigenvector centrality doesn't just look at the number of connections a node has (as in the case of degree centrality), but it also considers the quality of these connections. If a node is connected to others that have a high degree of connections themselves, then the node in question is considered to be more central.

A well-known application of this concept is Google's PageRank algorithm. The web pages you see in your search results aren't ranked solely based on the number of other pages linked to them but also on the quality (importance) of those linked pages.

If we come back to our story, after calculating Eigencentrality, I have some interesting founding. After three main characters, I found these three people play a pretty important role in Eigencentrality, including the Mundungus, Lovegood, and Scrimgeour. With my founding, I come back to look the movie again.

Mr Mundungus

Mr Mundungus enters the scene early in the film, joining five other brave Order of the Phoenix members. Their mission is paramount: to ensure Harry Potter's safe transportation to a secure location. Further into the movie: Mr Mundungus proves instrumental once again, providing a crucial hint that reveals the location of the first Horcrux. Thus, despite his sometimes dubious character, Mundungus plays a vital role in unfolding events.

Xenophilius Lovegood

Harry, Ron, and Hermione visited him. Xenophilius alerted Death Eaters, hoping they would return Luna in exchange. While waiting for the Death Eaters' arrival, Xenophilius told the trio about the Deathly Hallows.

Scrimegeour

The minister of magic, Scrimgeour, bombarded the trio with questions, attempting to discern Dumbledore's intentions, and was highly suspicious of the gifts he had left them: the Deluminator for Ron, a Golden Snitch for Harry, and an original copy of The Tales of Beedle the Bard for Hermione.

The R packages we're using include NRC, Wordcloud, Bing, Loughran, and Afinn. These are renowned lexicons commonly employed in Natural Language Processing (NLP).

The NRC, Bing, Loughran, and Afinn lexicons serve a similar purpose: they translate every English word into fundamental emotions and sentiments. Specifically, they categorize words according to basic emotions such as anger, fear, anticipation, trust, surprise, sadness, joy, and disgust, as well as two overarching sentiments: negative and positive.

For instance, the word "happy" is linked to positive emotions such as joy, while "murder" is connected with negative emotions like anger, fear, and sadness.

These lexicons are particularly useful in emotion and sentiment analysis. They facilitate the detection of sentiments and emotions in text, going beyond a simple dichotomy of positive and negative sentiment. By interpreting the text a person writes or the dialogue they engage in, these tools can help discern the intensity and type of emotions evoked. It's important to note, however, that while these lexicons offer substantial insight, they may not fully capture nuances, irony, or context-dependent sentiment in text.

I conduct a sentiment analysis on each line of the script, averaging the scores and presenting them in the bar chart. This gives us a general sense of the emotional trajectory of the entire film.

Notably, there's a significant dip in sentiment scores around the midpoint of the movie, coinciding with Ron's departure from the team. The sentiment score then markedly increases when they successfully destroy the first Horcrux. However, there is a downward trend towards the end of the film, particularly with Dobby's death, reflecting the emotional weight of that scene. This analysis visually illustrates the highs and lows of the film's narrative as perceived through sentiment scores.

The 'word cloud' package in R is a popular tool for creating a word cloud

cloud (or tag cloud) is a visual representation of text data where the size of each word indicates its frequency or importance in the source data.

1. Frequency Representation: The primary use of the 'word cloud' package is to represent the frequency of words in a text dataset visually.

2. Customization: You can customize various aspects of the word cloud, such as the maximum number of words to be displayed, the color of the words, the shape and layout of the cloud, and more.

3. Comparison Clouds: The package also provides a "comparison. cloud" function, which can be used to create a word cloud comparing word frequencies across different groups in the data.

Next, we performed an analysis of key character interactions using Bing and sentiment packages.

Our analysis revealed that approximately 86% of the words were neutral, with 7% each being classified as positive and negative.

When we examined individual characters, we found Harry's dialogue leaned towards positivity, while Hermione and Ron's dialogues veered towards negativity.

Interestingly, our character interaction analysis showed a positive sentiment for the Harry Hermione and Hermione-Ron dynamics. However, the sentiment was negative for the Harry - Ron interactions.

Overall, the script maintains a balanced distribution of negative and positive sentiments, indicating a well-rounded emotional trajectory throughout the narrative.

By integrating social network analysis, sentiment scoring, and word frequency analysis, we can drive game development in a more data-oriented manner. Taking Harry Potter as an instance, the concept of data stitching enables us to determine the most fitting words and tone Harry should use in specific scenarios within the game.

Let's consider an early-game scenario where all characters are gathering to escape to a safe house, and Harry is conversing with Hermione.

Given that the overall sentiment at the beginning of the game leans slightly positive, and considering his interactions with Hermione are positive, we can inform his dialogue accordingly.

In this context, words such as "Good", "Brilliant","Great","Thank" and "Promise" could be suggested for Harry's dialogue with Hermione, delivered in a tone of positivity or hope. This data-driven approach ensures character interactions are consistent with the sentiment and tone established in the original narrative.

Screenshot 2023-05-14 at 10.01.00 PM.png

As previously discussed, social network analysis can be instrumental in controlling community-level infections. Furthermore, we can also harness Natural Language Processing (NLP) technologies in the healthcare sector.

1. Sentiment Analysis: We could perform sentiment analysis for Electronic Health Records (EHRs). For example, patient correspondence or notes might reflect distress, satisfaction, or other emotions relevant to care.

2. Named Entity Recognition (NER): NER can be extremely useful for EHRs. It can help identify and categorize specific entities in the text, such as disease names, medication names, symptoms, etc. This can help to convert unstructured data into a structured format that can be more easily analyzed.

3. Combine Social Network Analysis, Sentiment Analysis, and Time Series Analysis, we can acquire a holistic understanding of a patient's treatment journey within the hospital. This comprehensive view can lead to more personalized care and improved patient outcomes.

Projects 1

Data Analytics Consultant - Yellow Dig

Yellow Dig is a modern online learning platform where students can interact with peers by posting questions etc. My role was to provide data analytics consultation for this project.

The goal of the project was to identify key metrics that measure the health of the platform. We want to evaluate student engagement in the community and provide recommendations to improve engagement.

Additionally, we want to segment partners based on important features of client relationships and project management.

Screen Shot 2022-05-03 at 11.08.59 AM.png

Multi-Linear Regression Model

I used a two-step process to identify the most critical features that can impact our target variable. The first step is developing a Multi-Linear Regression model to narrow down the feature selection.

In this step, the problems I faced were: First, some of our variables were highly correlated, and I used the Variance Inflection factor(VIF) to identify and remove these variables. Second, I also found that the data were skewed and kurtosis, and I applied log transformation to transfer these highly skewed data.

My model produced optimal MSE, MAE, and RMSE values of 0.074, 0.211 and 0.273 and captured a high R-square value of 0.922.

Decision Tree Model

Afterwards, I feed the subset of features to the Decision Tree model to understand what contributes to the Total Health Score variable value.

I binarised the Total Health Score variable by assigning 0 to any observations with Total Health Score less than 100 and giving 1 to all comments with a Total Health score greater than 100.

I created the training and Testing datasets based on the split 80/20. I used cross-validation to help me get the best complexity parameter. I looked at grid search with complexity parameter(cp) from 0 to 0.05 with a step 0.005 and then evaluated by using 10-fold cross-validations and repeating it three times.

The result shown in Figure CP of 0.005 was used as the optimal model. The essential variables in the predictive model are Comment Count followed by Word Count, Post views Count and Reaction Count. This model gives a training dataset accuracy of 87% and 86% of the accuracy of the testing dataset.

Cluster Analysis

To further understand industries and education types in how marketing decisions can be made and Yellowdig employees can best be assigned to partners to capture and maintain business, I chose to employ Cluster Analysis on both the Closed Won Clients and Clients Pipeline datasets to gather insights from both results.

Due to the nature of both datasets, I choose to use K-MEDOIDS for clustering to handle both categorical and numerical variables simultaneously. Additionally, I cleaned the dataset using mean imputation to replace NA values and the YeoJohnson method to standardise data.

I used the Silhouette method to find the k-value. The silhouette method is used in K-Medoids Cluster Analysis to visualise the average cluster quality in a graph. I am choosing the optimal number of clusters k by selecting the value that maximises the average silhouette over a range of possible values for k.

Interesting Finding and Impact

Our analysis generated insights and we were able to turn the insights into business recommendations. Previously, Yellow Dig only considered college students as their core customer group. However, the analysis informed us that yellow dig also has decent traffic coming from K-12 businesses in the Northeast area. We advised Yellow dig to pay attention to the newly emerged user group and design a new service offering for the new users' needs.

Screen Shot 2022-05-03 at 1.22.53 PM.png

Screen Shot 2022-05-03 at 11.30.58 AM.png

Screen Shot 2022-05-03 at 11.44.25 AM.png

Screen Shot 2022-05-03 at 11.49.10 AM.png

Screen Shot 2022-05-03 at 1.17.07 PM.png

Screen Shot 2022-05-03 at 1.15.22 PM.png

Projects 2

Analytics Consulting PriceWaterHouseCoopers

With data provided by the United States Census Bureau, I examined the impact that COVID-19 had on individuals, households, industries, and the USA in general. The stakeholders for this research start with PricewaterhouseCoopers and its employees, PWC's clients, and all involved parties in its activities.

Goal of Project

The project's first goal is to conduct a descriptive analysis that will enlighten the objective of depicting the impact of COVID-19 in the field of childcare, education, employment, food security, health, housing, and all its spending.

The second goal is to use a predictive model generated to classify the vaccination intent of the US population. A sample from January 2021 to July 2021 was taken for this project to ensure an accurate representation of the researched data.

Descriptive analysis

For Descriptive Analysis, I used 136 variables and, 884695 observations, 114 categorical variables.

-59.5% female and 84% Caucasian

- Education: 13.6% high school or lower, 60.8% at least some college, 18.4% graduate education.

- Martial Status: 18.4% never married, 58.3% married, 22.5% widowed, separated, or divorced

- The median birth year of 1966

- Households contain an average of 2.72 people

I found that those not vaccinated were young, low-income, and had low education.

I also found that people's vaccination intent changes over time. From week 22 to week 33, I've found that more people are starting to choose to refuse vaccinations.

I applied clustering analysis in our descriptive analysis. Using the ward's method. For cluster validation, I decided to use a Cophentic distance equal to 0.39.

The results show that people in cluster 2 experienced the most severe unemployment and mental disorders.

Screen Shot 2022-05-03 at 10.27.07 PM.png

Screen Shot 2022-05-03 at 10.44.22 PM.png

Screen Shot 2022-05-03 at 11.48.07 PM.png

Predictive analysis

Data Transformation

Target Variable: Vaccine Intent

- Filtered to exclude vaccinated participants

- Filtered to exclude the missing value

- Follow-up questions to target variable removed to avoid potential bias

Dimensionality Reduction:

-Removed survey questions unique to Phase 3 or 3.1

-Removed conditional survey questions

-Removed irrelevant variables related to vaccination

-Removed row with more than 20% missing values

-Applied downsampling to correct for class imbalance.

Dimensions: 119 Variables and 374,908 Respondents

- Numerical: 8

- Nominal: 96

- Ordinal: 16

A decision tree is one of the most popular machine algorithms. I split my dataset 70/30. I also found dataset has a Class imbalance problem. I used Cross-validation five-folds and repeated them three times to figure out this problem. I applied Downsample train dataset to randomly sample the training dataset so that all classes have the same frequency.

Decision Trees are notorious for being overfitted to the training dataset. I apply the Hyperparameter tuning method to simplify the whole process of finding a smaller tree that improves the accuracy of the testing dataset.

- Cross-validation using ten folds and repeated three times

- Perform grid search from 0 to 0.05 for optimal complexity parameter (cP)

- cP 0.0005 produced the highest accuracy.

Overall accuracy: 68%

Random Forest is a supervised machine learning algorithm that is based on the ensemble of decision trees, usually trained with the "bagging" method. I split the dataset into 70/30. I used RandomizedSearchCV from sklearn to optimize my hyperparameters. I decided to focus on 3 hyperparameters: n_estimators, max_features, and max_depth.

My results, were: 'n_extimators'=600; 'max_feattures' ='sqrt';'max_depth':300.

My overall accuracy score improved from 0.793 to 0.829.

I had a chance to indicate several necessary factors that influence vaccination intents.

-Education level

-Age

-Income

Models Selection and Enginering

Decision Tree

Random Forest

Screen Shot 2022-05-03 at 10.53.19 PM.png

Recommendations

- A vaccination education project

- Employee retention and recruitment strategies

- Mental health public service campaigns

-Government contractor projects that assist with public health initiatives, such as food security

Projects 3

Esports & 2021 League of legends regression analysis

The Esports industry has seen tremendous growth, both viewership and revenue. The global esports market was estimated at USD 1.48 billion in 2020 and is expected to reach USD 6.81 billion in 2027. Total esports viewership is expected to grow at a 9% compound annual growth rate between 2019 and 2023, up from 454 million in 2019 to 646 million in 2023.

I always remember going to Madison Garen Square in New York with friends in 2016 to watch the League of Legends semi-finals SKT vs Rox Tiger. I felt the superb level of the players and the boundless enthusiasm of the fans. At the time, I thought it would be a new era.

I first analysed the entire market and found that Multiplayer Online Battle Arena (MOBA) games are far ahead in terms of both popularity and the number of bonuses for professional players.

Screen Shot 2022-05-05 at 10.35.12 AM.png

Screen Shot 2022-05-05 at 10.38.23 AM.png

2021 League of Legends Regression Analysis

League of Legends(LOL) has been a staple in the esports community since it was first released back in 2009. One of the original online multiplayer battle arena games, the game has generated an astonishing $10.18 billion in revenue from 2015 to 2020.

Given the massive revenue generation and the huge audiences the game sees across digital and traditional entertainment platforms, it is safe to say that League of Legends is likely the largest growing sport globally, with the potential to overtake more traditional sports in the coming years. A recent study from Nielson showed that the League of Legends European Championships had a higher average minute audience than basketball, tennis and Rugby for people aged between 16 and 29 (Church, 2021)

This study will use 2021 League of Legends match data to judge which factor is the most significant in determining match results. I mainly focus on the effects of gameplay and decisions on the Kill Death Assist Ratio(KDA), which is believed to be the most significant predictor of results. Overall, I am trying to develop a model that the highest KDA will allow for the essential ability to predict a match-winner at the team level.

As stated above, the data source for this study is the 2021 League of Legends match data, sourced from Oracles Elixer.

The total dataset contains 123 variables with 147,924 observations. The dataset includes Kills, Deaths and Assists, but not KDA as a metric, so the KDA metric was created using the formula (Kills+Deaths)/Assists. Following the creation of the Y-Variable, logistic regression was run on the result based on KDA. The logistic regression showed that the P-value was significant, thus deciding to use KDA as more credible for the analysis. We drop some variables we don't need, leave around 52 variables to build the model.

Screen Shot 2022-05-05 at 11.17.15 AM.png

VARIABLE SELECTION

Based on the gaming experience, I reduced the dimensionalities of data by selecting certain related variables. Instead of focusing on individual players, I choose to work on the team result. As discussed above, kills, deaths, and assists play the most critical roles in deciding the winner of the game.

To further reduce the dimensionality of the dataset, we choose to drop some variables: game length, total gold, killsat15, deathsat15, earnedgold, infernals, goldat15, xpat15,golddiffat15, xpdiffat15, opp_deathsat15, visionscore, goldspend, heralds, mountains, result, dragons, towers, opp_towers, wardsplaced, miniokills, opp_goldat15. This decision was made based on the initial VIF run on the entire data set. The variables with a VIF over 5 were removed initially to drop our total number of variables from the initial number of 51 down to a more manageable 27.

Screen Shot 2022-05-05 at 11.27.38 AM.png

MODEL SELECTION

The first model was run with 18 remaining independent variables, and I got R-squared (0.6093) and adjusted R-squared (0.6087) to the full model.

Next, I want to verify the impact of going into the playoffs on my research. I then ran the analysis on these two datasets using the pre-processed variables to give me three total models to perform out of sample testing on to attempt to drill down the indicators of KDA and, in turn, the effect of KDA on match results.

Model2 and Model 3 were created using the preprocessed variables from before. Still, with the data split into two groups, the first(model 2) only included teams that made the playoffs, and the second(model 3) only included teams that did not make the playoffs. This analysis aimed to see if separating the teams by a measure of ability could allow for a model with more robust predicting power based on KDA.

Screen Shot 2022-05-05 at 11.50.23 AM.png

Screen Shot 2022-05-05 at 12.22.45 PM.png

RESULT

After selecting the three models, out-of-sample testing was performed on the three models discussed above, each being split into a training and testing data set. Once each data set was split into training and testing sets, the models went through 10x k-fold cross-validation to drill deeper into the results.

Mode1 showed the best results with an r-squared value of 0.63 and an MAE of 2.00. This was significantly better than Model 2 and Model3, which showed R-squared values of 0.40 and 0.42, RMSE values of 5.28 and 5.15 and MAE values of 2.93 and 2.87.

Once cross-validation was used, it was time to choose the correct threshold for the prediction models to ultimately test the overall accuracy of using the three models that predicted KDA values to then predict match results in the future. After analyzing the training datasets, the threshold chosen to use for prediction was 6, meaning, for the prediction model, if any team had a KDA of 6 or more, they would be predicted to win the match.

Unsurprisingly Model 1 showed the best prediction probability with an accuracy of 0.89, while Model 3 was second best with an accuracy of 0.88 and Model 2 was the worst with an accuracy of 0.83.

Model 1 Accuracy Results

Screen Shot 2022-05-05 at 12.54.59 PM.png

Projects 4

Analysis of the Housing Market in Two US Cities from 2020-2022

Since the COVID-19 pandemic in 2020, real estate pricing has been quite volatile. The initial lockdowns led to a significant drop in property values across the country, but shortly after that, a considerable spike has continued up until now. The project will implement different machine-learning methods to explore the U.S. housing market.

The motivation behind this project comes from the team members. We are all graduate students interested in purchasing real estate in the future, and we would like to know what to expect from this market. It could also be helpful for bans that give out home loans, mortgage companies, and escrow companies. In addition, anyone that is interested in purchasing a home

- Several datasets were chosen for this project. The first dataset originally came from www.realtor.com. A real estate listing website operated by the New Corp subsidiary Move, Inc. and based in Santa Clara, California. However, this data set contains properties from across the United States. As we are for the context of this project, only focusing on the dataset to get properties in those two states.

- Loan-level Public-Use Databases (PUDBs) are datasets that contain information on individual loans that are insured or guaranteed by the Federal Housing Administration (FHA), the Department of Veterans Affairs(VA), or the US Department of Agriculture's Rural Housing Service (RHS).

PUDBs are commonly used for research and analysis in the mortgage industry and by policymakers. They can be used to examine mortgage market trends, evaluate government housing programs' effectiveness, and assess the impact of various economic factors on mortgage performance.

- The Rolling Sales Data is the dataset that contains information on real estate sales transactions in New York City(NYC). The data is sourced from the NYC Department of Finance's Automated City Register Information System (ACRIS). The dataset includes records of all real estate sales transactions conducted in NYC from January 1, 2010, through December 31, 2020.

The Rolling Sales data is commonly used by real estate professionals, analysts, and researchers to analyze trends in the NYC real estate market, such as changes in property values and the number of sales transactions over time. The dataset can also be used to study the impact of various factors on real estate prices, such as changes in interest rates, the local economy, and demographic trends.

Data We Use