Predictive analysis in R Language is a branch of analysis which uses statistics operations to analyze historical facts to make predict future events. It is a common term used in data mining and machine learning. Methods like time series analysis, non-linear least square, etc. are used in predictive analysis. Using predictive analytics can help many businesses as it finds out the relationship between the data collected and based on the relationship, the pattern is predicted. Thus, allowing businesses to create predictive intelligence.

We’ll discuss the process, need and applications of predictive analysis with example codes.

#### Process of Predictive Analysis

Predictive analysis consists of 7 processes as follows:

**Define project:**Defining the project, scope, objectives and result.**Data collection:**Data is collected through data mining providing a complete view of customer interactions.**Data Analysis:**It is the process of cleaning, inspecting, transforming and modelling the data.**Statistics:**This process enables validating the assumptions and testing the statistical models.**Modelling:**Predictive models are generated using statistics and the most optimized model is used for the deployment.**Deployment:**The predictive model is deployed to automate the production of everyday decision-making results.**Model monitoring:**Keep monitoring the model to review performance which ensures expected results.

#### Need of Predictive Analysis

**Understanding customer behavior:**Predictive analysis uses data mining feature which extracts attributes and behavior of customers. It also finds out the interests of the customers so that business can learn to represent those products which can increase the probability or likelihood of buying.**Gain competition in the market:**With predictive analysis, businesses or companies can make their way to grow fast and stand out as a competition to other businesses by finding out their weakness and strengths.**Learn new opportunities to increase revenue:**Companies can create new offers or discounts based on the pattern of the customers providing an increase in revenue.**Find areas of weakening:**Using these methods, companies can gain back their lost customers by finding out the past actions taken by the company which customers didn’t like.

#### Applications of Predictive Analysis

**Health care:**Predictive analysis can be used to determine the history of patient and thus, determining the risks.**Financial modelling:**Financial modelling is another aspect where predictive analysis plays a major role in finding out the trending stocks helping the business in decision making process.**Customer Relationship Management:**Predictive analysis helps firms in creating marketing campaigns and customer services based on the analysis produced by the predictive algorithms.**Risk Analysis:**While forecasting the campaigns, predictive analysis can show an estimation of profit and helps in evaluating the risks too.

**Example:**

Let us take an example of time analysis series which is a method of predictive analysis in R programming:

`x <-` `c(580, 7813, 28266, 59287, 75700, ` ` 87820, 95314, 126214, 218843, 471497, ` ` 936851, 1508725, 2072113) ` ` ` `# library required for decimal_date() function ` `library(lubridate) ` ` ` `# output to be created as png file ` `png(file` `="predictiveAnalysis.png") ` ` ` `# creating time series object ` `# from date 22 January, 2020 ` `mts <-` `ts(x, start =` `decimal_date(ymd("2020-01-22")), ` ` frequency =` `365.25` `/` `7) ` ` ` `# plotting the graph ` `plot(mts, xlab ="Weekly Data of sales", ` ` ylab ="Total Revenue", ` ` main ="Sales vs Revenue", ` ` col.main ="darkgreen") ` ` ` `# saving the file ` `dev.off() ` |

**Output:**

**Forecasting Data:**

Now, forecasting sales and revenue based on historical data.

`x <-` `c(580, 7813, 28266, 59287, 75700, ` ` 87820, 95314, 126214, 218843, ` ` 471497, 936851, 1508725, 2072113) ` ` ` `# library required for decimal_date() function ` `library(lubridate) ` ` ` `# library required for forecasting ` `library(forecast) ` ` ` `# output to be created as png file ` `png(file` `="forecastSalesRevenue.png") ` ` ` `# creating time series object ` `# from date 22 January, 2020 ` `mts <-` `ts(x, start =` `decimal_date(ymd("2020-01-22")), ` ` frequency =` `365.25` `/` `7) ` ` ` `# forecasting model using arima model ` `fit <-` `auto.arima(mts) ` ` ` `# Next 5 forecasted values ` `forecast(fit, 5) ` ` ` `# plotting the graph with next ` `# 5 weekly forecasted values ` `plot(forecast(fit, 5), xlab ="Weekly Data of Sales", ` `ylab ="Total Revenue", ` `main ="Sales vs Revenue", col.main ="darkgreen") ` ` ` `# saving the file ` `dev.off() ` |

**Output:**

### Performing Hierarchical Cluster Analysis using R

**Cluster analysis** or clustering is a technique to find subgroups of data points within a data set. The data points belonging to the same subgroup have similar features or properties. Clustering is an unsupervised machine learning approach and has a wide variety of applications such as market research, pattern recognition, recommendation systems, and so on. The most common algorithms used for clustering are K-means clustering and Hierarchical cluster analysis. In this article, we will learn about hierarchical cluster analysis and its implementation in R programming.

**Hierarchical cluster analysis** (also known as hierarchical clustering) is a clustering technique where clusters have a hierarchy or a predetermined order. Hierarchical clustering can be represented by a tree-like structure called a **Dendrogram**. There are two types of hierarchical clustering:

**Agglomerative hierarchical clustering**: This is a bottom-up approach where each data point starts in its own cluster and as one moves up the hierarchy, similar pairs of clusters are merged.**Divisive hierarchical clustering**: This is a top-down approach where all data points start in one cluster and as one moves down the hierarchy, clusters are split recursively.

To measure the similarity or dissimilarity between a pair of data points, we use distance measures (Euclidean distance, Manhattan distance, etc.). However, to find the dissimilarity between two clusters of observations, we use agglomeration methods. The most common agglomeration methods are:

**Complete linkage clustering**: It computes all pairwise dissimilarities between the observations in two clusters, and considers the longest (maximum) distance between two points as the distance between two clusters.**Single linkage clustering**: It computes all pairwise dissimilarities between the observations in two clusters, and considers the shortest (minimum) distance as the distance between two clusters.**Average linkage clustering**: It computes all pairwise dissimilarities between the observations in two clusters, and considers the average distance as the distance between two clusters.

### Performing** Hierarchical Cluster Analysis using R**

For computing hierarchical clustering in R, the commonly used functions are as follows:

**hclust**in the stats package and**agnes**in the cluster package for agglomerative hierarchical clustering.**diana**in the cluster package for divisive hierarchical clustering.

We will use the Iris flower data set from the datasets package in our implementation. We will use sepal width, sepal length, petal width, and petal length column as our data points. First, we load and normalize the data. Then the dissimilarity values are computed with *dist* function and these values are fed to clustering functions for performing hierarchical clustering.

`# Load required packages` `library(datasets) # contains iris dataset` `library(cluster) # clustering algorithms` `library(factoextra) # visualization` `library(purrr) # to use map_dbl() function` ` ` `# Load and preprocess the dataset` `df <- iris[, 1:4]` `df <- na.omit(df)` `df <- scale(df)` ` ` `# Dissimilarity matrix` `d <- dist(df, method = "euclidean")` |

**Agglomerative hierarchical clustering implementation**

The dissimilarity matrix obtained is fed to *hclust*. The *method *parameter of *hclust* specifies the agglomeration method to be used (i.e. complete, average, single). We can then plot the dendrogram.

`# Hierarchical clustering using Complete Linkage` `hc1 <- hclust(d, method = "complete"` `)` ` ` `# Plot the obtained dendrogram` `plot(hc1, cex = 0.6, hang = -1)` |

**Output:**

Observe that in the above dendrogram, a leaf corresponds to one observation and as we move up the tree, similar observations are fused at a higher height. The height of the dendrogram determines the clusters. In order to identify the clusters, we can cut the dendrogram with *cutree*. Then visualize the result in a scatter plot using *fviz_cluster* function from the *factoextra* package.

`# Cut tree into 3 groups` `sub_grps <- cutree(hc1, k = 3)` ` ` `# Visualize the result in a scatter plot` `fviz_cluster(list(data = df, cluster = sub_grps))` |

**Output:**

We can also provide a border to the dendrogram around the 3 clusters as shown below.

`# Plot the obtained dendogram with ` `# rectangle borders for k clusters` `plot(hc1, cex = 0.6, hang = -1)` `rect.hclust(hc1, k = 3, border = 2:4)` |

**Output:**

Alternatively, we can use the *agnes* function to perform the hierarchical clustering. Unlike *hclust*, the *agnes *function gives the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure).

`# agglomeration methods to assess` `m <- c("average", "single", "complete")` `names(m) <- c("average", "single", "complete")` ` ` `# function to compute hierarchical ` `# clustering coefficient` `ac <- function(x) {` ` agnes(df, method = x)$ac` `}` ` ` `map_dbl(m, ac)` |

**Output:**

average single complete

0.9035705 0.8023794 0.9438858

Complete linkage gives a stronger clustering structure. So, we use this agglomeration method to perform hierarchical clustering with *agnes* function as shown below.

`# Hierarchical clustering ` `hc2 <- agnes(df, method = "complete")` ` ` `# Plot the obtained dendogram` `pltree(hc2, cex = 0.6, hang = -1, ` ` main = "Dendrogram of agnes")` |

**Output:**

**Divisive clustering implementation**

The function *diana *which* *works similar to *agnes *allows us to perform divisive hierarchical clustering. However, there is no *method* to provide.

`# Compute divisive hierarchical clustering` `hc3 <- diana(df)` ` ` `# Divise coefficient` `hc3$dc` ` ` `# Plot obtained dendrogram` `pltree(hc3, cex = 0.6, hang = -1, ` ` main = "Dendrogram of diana")` |

**Output:**

[1] 0.9397208