Compute multiple types of correlations analysis (Pearson correlation, R^2 coefficient of linear regression, Cramer's V measure of association, Distance Correlation,The Maximal Information Coefficient, Uncertainty coefficient and Predictive Power Score) in large dataframes with mixed columns classes(integer, numeric, factor and character) in parallel backend.
Correlation-like analysis provides an important statistical measure that describes the size and direction of an association between variables. However, there are few R packages that can efficiently perform this type of analysis on large datasets with mixed data types. The corrp
package provides a full suite of solutions for computing various correlation-like measures, such as Pearson correlation, Distance Correlation, Maximal Information Coefficient (MIC), Predictive Power Score (PPS), Cramér’s V, and the Uncertainty Coefficient. These methods support the analysis of data frames with mixed classes (integer, numeric, factor, and character).
Additionally, it offers a C++ implementation of the Average Correlation Clustering Algorithm (ACCA) ACCA, which was originally developed for genetic studies using Pearson correlation as a similarity measure. In general, ACCA is an unsupervised clustering method, as it identifies patterns in the data without requiring predefined labels. Moreover, it requires the K parameter to be defined, similar to k-means. One of its main differences compared to other clustering methods is that it operates based on correlations rather than traditional distance metrics, such as Euclidean or Mahalanobis distance.
In this package, the ACCA algorithm has been extended to work directly with correlation matrices derived from different association methods, depending on the data types and user preferences. Furthermore, the package is designed for parallel processing in R, making it highly efficient for large datasets.
The corrp package under development by Meantrix team and original based on Srikanth KS (talegari) cor2 function can provide to R users a way to work with correlation analysis among large data.frames, tibbles or data.tables through a R parallel backend and C++ functions.
The data.frame is allowed to have columns of these four classes: integer, numeric, factor and character. The character column is considered as categorical variable.
In this new package the correlation is automatically computed according to the follow options:
Also, All statistical tests are controlled by the confidence interval of p.value parameter. If the statistical tests do not obtain a significance greater/less than p.value the value of variable isig
will be FALSE
.
If any errors occur during operations the association measure (infer.value
) will be NA
.
#’ The result data
and index
will have \eqn{N^2} rows, where N is the number of variables of the input data.
By default, the statistical significance test for the PPS algorithm is not calculated, as it is prohibitively expensive for medium to large datasets. In this case isig
is NA, you can enable it by setting ptest = TRUE
in pps.args
.
All the *.args
can modify the parameters (p.value
, comp
, alternative
, num.s
, rk
, ptest
) for the respective method on it’s prefix.
Before you begin, ensure you have met the following requirement(s):
R >= 3.6.2
installed.Install the development version from GitHub:
library('remotes')
remotes::install_github("meantrix/corrp@main")
corrp package
provides seven main functions for correlation calculations, clustering and basic data manipulation: corrp
,
corr_fun
, corr_matrix
, corr_rm
, acca
, sil_acca
and best_acca
.
corrp
Next, we calculate the correlations for the data set iris using: Maximal Information Coefficient for numeric pair, the Power Predictive Score algorithm for numeric/categorical pair and Uncertainty coefficient for categorical pair.
# coorp with using iris using parallel processing
results = corrp::corrp(iris, cor.nn = 'mic', cor.nc = 'pps',cor.cc = 'uncoef', n.cores = 2 , verbose = FALSE)
# an sequential example with different correlation pair types
results_2 = corrp::corrp(mtcars, cor.nn = 'pps', cor.nc = 'lm', cor.cc = 'cramersV', parallel = FALSE, verbose = FALSE)
head(results$data)
# infer infer.value stat stat.value isig msg varx vary
# Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length
# Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width
# Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length
# Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width
# Predictive Power Score 0.5591864 F1_weighted 0.7028029 NA Sepal.Length Species
# Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Width Sepal.Length
head(results_2$data)
# infer infer.value stat stat.value isig msg varx vary
# Predictive Power Score 1.0000000 <NA> NA NA mpg mpg
# Predictive Power Score 0.3861810 MAE 0.8899206 NA mpg cyl
# Predictive Power Score 0.3141056 MAE 74.7816795 NA mpg disp
# Predictive Power Score 0.2311418 MAE 42.3961506 NA mpg hp
# Predictive Power Score 0.1646116 MAE 0.3992651 NA mpg drat
# Predictive Power Score 0.2075760 MAE 0.5768637 NA mpg wt
corr_matrix
Using the previous result we can create a correlation matrix as follows:
m = corr_matrix(results,col = 'infer.value',isig = TRUE)
m
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# Sepal.Length 0.9994870 0.2770503 0.7682996 0.6683281 0.4075487
# Sepal.Width 0.2770503 0.9967831 0.4391362 0.4354146 0.2012876
# Petal.Length 0.7682996 0.4391362 1.0000000 0.9182958 0.7904907
# Petal.Width 0.6683281 0.4354146 0.9182958 0.9995144 0.7561113
# Species 0.5591864 0.3134401 0.9167580 0.9398532 0.9999758
# attr(,"class")
# [1] "cmatrix" "matrix"
Now, we can clustering the data set variables through ACCA and the correlation matrix.
By way of example, consider 2 clusters k = 2
:
acca.res = acca(m,2)
acca.res
# $cluster1
# [1] "Species" "Sepal.Length" "Petal.Width"
#
# $cluster2
# [1] "Petal.Length" "Sepal.Width"
#
# attr(,"class")
# [1] "acca_list" "list"
Also,we can calculate The average silhouette width to the cluster acca.res
:
sil_acca(acca.res,m)
# [1] -0.02831006
# attr(,"class")
# [1] "corrpstat"
# attr(,"statistic")
# [1] "Silhouette"
Observations with a large average silhouette width (almost 1) are very well clustered.
To contribute to corrp
, follow these steps:
git checkout -b <branch_name>
.git commit -m '<commit_message>'
git push origin corrp/<location>
Alternatively see the GitHub documentation on creating a pull request.
If you have detected a bug (or want to ask for a new feature), please file an issue with a minimal reproducible example on GitHub.
This project uses the following license: GLP3 License.