Compute multiple types of correlations analysis (Pearson correlation, R^2 coefficient of linear regression, Cramer's V measure of association, Distance Correlation,The Maximal Information Coefficient, Uncertainty coefficient and Predictive Power Score) in large dataframes with mixed columns classes(integer, numeric, factor and character) in parallel backend.
Compute multiple types of correlation analysis (Pearson correlation, R^2 coefficient of linear regression, Cramer’s V measure of association, Distance Correlation, The Maximal Information Coefficient, Uncertainty coefficient and Predictive Power Score) in large dataframes with mixed columns classes(integer, numeric, factor and character) in parallel R backend.
This package also has a C++ implementation of the Average correlation clustering algorithm ACCA
that works directly with the correlation matrix. In this sense, this implementation differs from the original,
it works with mixed data and several correlation types methods.
The corrp package under development by Meantrix team and original based on Srikanth KS (talegari) cor2 function can provide to R users a way to work with correlation analysis among large data.frames, tibbles or data.tables through a R parallel backend and C++ functions.
The data.frame is allowed to have columns of these four classes: integer, numeric, factor and character. The character column is considered as categorical variable.
In this new package the correlation is automatically computed according to the follow options:
Also, all statistical tests are controlled by the sigficance of
p.value param. If the statistical tests do not obtain a significance greater/less
than p.value, by default the output of variable isig
will be FALSE
.
There is no statistical significance test for the pps
algorithm, isig = TRUE
in this case.
If any errors occur during operations by default the correlation will be NA
.
Before you begin, ensure you have met the following requirement(s):
R >= 3.6.2
installed.Install the development version from GitHub:
library('remotes')
remotes::install_github("meantrix/corrp@main")
corrp package
provides seven main functions for correlation calculations, clustering and basic data manipulation: corrp
,
corr_fun
, corr_matrix
, corr_rm
, acca
, sil_acca
and best_acca
.
corrp
Next, we calculate the correlations for the data set iris using: Maximal Information Coefficient for numeric pair, the Power Predictive Score algorithm for numeric/categorical pair and Uncertainty coefficient for categorical pair.
results = corrp::corrp(iris, cor.nn = 'mic',cor.nc = 'pps',cor.cc = 'uncoef', n.cores = 2 , verbose = FALSE)
head(results$data)
# infer infer.value stat stat.value isig msg varx vary
# Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length
# Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width
# Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length
# Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width
# Predictive Power Score 0.5591864 F1_weighted 0.7028029 TRUE Sepal.Length Species
# Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Width Sepal.Length
corr_matrix
Using the previous result we can create a correlation matrix as follows:
m = corr_matrix(results,col = 'infer.value',isig = TRUE)
m
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# Sepal.Length 0.9994870 0.2770503 0.7682996 0.6683281 0.4075487
# Sepal.Width 0.2770503 0.9967831 0.4391362 0.4354146 0.2012876
# Petal.Length 0.7682996 0.4391362 1.0000000 0.9182958 0.7904907
# Petal.Width 0.6683281 0.4354146 0.9182958 0.9995144 0.7561113
# Species 0.5591864 0.3134401 0.9167580 0.9398532 0.9999758
# attr(,"class")
# [1] "cmatrix" "matrix"
Now, we can clustering the data set variables through ACCA and the correlation matrix.
By way of example, consider 2 clusters k = 2
:
acca.res = acca(m,2)
acca.res
# $cluster1
# [1] "Species" "Sepal.Length" "Petal.Width"
#
# $cluster2
# [1] "Petal.Length" "Sepal.Width"
#
# attr(,"class")
# [1] "acca_list" "list"
Also,we can calculate The average silhouette width to the cluster acca.res
:
sil_acca(acca.res,m)
# [1] -0.02831006
# attr(,"class")
# [1] "corrpstat"
# attr(,"statistic")
# [1] "Silhouette"
Observations with a large average silhouette width (almost 1) are very well clustered.
To contribute to corrp
, follow these steps:
git checkout -b <branch_name>
.git commit -m '<commit_message>'
git push origin corrp/<location>
Alternatively see the GitHub documentation on creating a pull request.
If you have detected a bug (or want to ask for a new feature), please file an issue with a minimal reproducible example on GitHub.
This project uses the following license: GLP3 License.