## A simple journey on analytics from problem statement to solution

“Let us say that you want to purchase a house in a locality as part of your investments and rent it out. You would like to understand what will be the rent, that you can get from this investment“

You walk around the locality and randomly knock on various houses and talk to the residents. You are doing **simple random sampling**. You realize that within the same locality, the rents of houses varies depending on direction from the center or how far it is from the main roads. There are 30 roads/streets in this locality and on an average 300 houses per street. You decide then that you want to go to every 3rd street and talk to every 10th house. You end up with a sample size of (10 streets * 30 houses/street) 300 houses. **This is called systematic random sampling**. You then realize that there are independent houses and there are large apartment complexes and small apartment complexes. You ensure that these types of residences are covered in your sampling. You could ensure that in each street, atleast 10 houses are each from large and small apartment complexes). You have now done **stratified sampling**.

With this sample of 300, you calculate the average or mean rent of the house is 15K. Some houses are lower and some houses are higher. You represent how far the rent is spread out from the mean, using variance and standard deviation. When you explain the sample using these metrics (mean, Var and SD) you are performing **descriptive statistics**

Now, with this information, if you want to draw conclusion about the house rents in the entire locality, it is called **inferential statistics**. You make the conclusion with some level of confidence which is called the confidence interval (usually 95%). You would say that with 95% confidence, the rent in this locality is 15K +/- something (Confidence interval)

After seeing the variability in the data, you realize that the rent changes may be because of various reasons. You **hypothesize** that this may be because of age of apartment, carpet area, water availability, features like gym, availability of alternate water sources, maintenance cost, vicinity to transportation, street lighting in the area, crime rate, hotels , medical shops, groceries nearby, playground, builder type, parking area, garden area… and so on..

For each of selected sample, you collect the data through some mechanisms. You realize that some of the data points are missing for some houses and they have provided built up area. You want to represent the age of the house as < 1 year, 2 to 5 years, 5 to 10 years, 10+ years …etc. instead of numeric age. You are now applying **data imputation, correction and transformation** steps.

On the cleaned up data, you perform some **data exploration** steps. You check if there is any correlation between rent and each of the hypothesis variables and short list fewer variables than what you set out to initially check.

With the selected variables which seem to be correlated with the rent variable, you move them into a modeling stage. For this example, you have a target variable of rent, hence you apply **supervised technique** and since the target variable is a numeric value , you apply a technique called **regression**.

When you run this regression model, the machine looks into the data and tries to form some kind of relation between the rent and the selected fields. It comes out with an approximation logic (also called **analytical model**) that explains the relation between the target and input variables. This is called **machine learning**, as the machine determined the logic based on data and we didn’t force the logic. If you enrich the data with more input variables or sample size, the machine would learn better and create a better equation or model which is closer to reality.

With this model, if you are able to explain how rent is influenced by a set of variables on the existing data, you have performed **diagnostic analytics**. You test this model with a few more samples and check how close it is to reality.

When you are happy with the results, you can look at a particular house to purchase and based on the input variables like parking lot, carpet area…etc, you can use the same model to “predict” what would be the rent . This is called **predictive analytics**, where for known inputs, you are trying to predict the output.

Supposing you are fixated on the fact that you want a minimum rent of 27K (target is fixed). You are then trying to understand what combination of input variables you must have, such as like buy a house near to the main road, but it can be an independent house, to be let out on corporate ease and should minimum of 3 ACs and servant quarters. You use the same model and run it through an **optimization technique** with different inputs to maximize the rent. This is called **prescriptive analytics**

Where does AI fit in here?

Blog disclaimer:

This is a professional weblog, and we have invited experts to share their thoughts, expertise , perspectives and knowledge. The opinions expressed here are purely representing their personal views and not those of any institution, employer or company.