Data Transformation (Log, square root, cube root, Tukey Ladder, and Boxcox methods ) In R studio

Data Transforming
Most parametric tests require that residuals be normally distributed and that the residuals be homoscedastic. One approach when residuals fail to meet these conditions is to transform one or more variables to better follow a normal distribution. Often, just the dependent variable in a model will need to be transformed. However, in complex models and multiple regression, it is sometimes helpful to transform both dependent and independent variables that deviate greatly from a normal distribution.
There is nothing illicit in transforming variables, but you must be careful about how the results from analyses with transformed variables are reported. For example, looking at the turbidity of water across three locations, you might report, “Locations showed a significant difference in log-transformed turbidity.” To present means or other summary statistics, you might present the mean of transformed values, or back transform means to their original units.
Some measurements in nature are naturally normally distributed. Other measurements are naturally log-normally distributed. These include some natural pollutants in water: There may be many low values with fewer high values and even fewer very high values.
For right-skewed data-tail is on the right, positive skew, common transformations include square root, cube root, and log.
For left-skewed data-tail is on the left, negative skew-, common transformations include square root (constant - x), cube root (constant - x), and log (constant - x).
Because log (0) is undefined-as is the log of any negative number-, when using a log transformation, a constant should be added to all values to make them all positive before the transformation. It is also sometimes helpful to add a constant when using other transformations.
Another approach is to use a general power transformation, such as Tukey’s Ladder of Powers or a Box-Cox transformation. These determine a lambda value, which is used as the power coefficient to transform values. X.new = X ^ lambda for Tukey, and X.new = (X ^ lambda - 1) / lambda for Box-Cox.
The function transformTukey in the rcompanion package finds the lambda which makes a single vector of values-that is, one variable-as normally distributed as possible with a simple power transformation.
The Box-Cox procedure is included in the MASS package with the function boxcox. It uses a log-likelihood procedure to find the lambda to use to transform the dependent variable for a linear model (such as an ANOVA or linear regression). It can also be used on a single vector.
Packages used in these tutors
The packages used in this chapter include:
• MASS
• rcompanion
• psych
The following commands will install these packages if they are not already installed:
if(!require(MASS)){install.packages("MASS")}
if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(psych)){install.packages("psych")}
the scrpit for this tutorials!
Data transformation
data=c(1,3,4,5,6,100,233,1000,1500,2000,10000,45000,9000,12000,20000)
library(rcompanion)
plotNormalHistogram(data)
qqnorm(data)
qqline(data,col="blue")
#Square root transformation
data_sqrt= sqrt(data)
library(psych)
skew(data)
plotNormalHistogram(data_sqrt)
#Cube root transformation
data_cub=sign(data) * abs(data)^(1/3)
plotNormalHistogram(data_cub)
#Log transformation
data_log =log(data)
plotNormalHistogram(data_log)
#Tukey's Ladder of Powers transformation
data_tuk =transformTukey(data,plotit=TRUE)
plotNormalHistogram(data_tuk)
#Box-Cox transformation
library(MASS)
Box = boxcox(data~ 1,lambda= seq(-2,2,0.1))
Create a data frame with the results
Cox = data.frame(Box$x, Box$y)
Order the new data frame by decreasing y
Cox2 = Cox[with(Cox, order(-Cox$Box.y)),]
Display the lambda with the greatest
Cox2[1,]
Extract that lambda
lambda = Cox2[1,"Box.x"]
Transform the original data
data_box=(data^lambda-1)/lambda
plotNormalHistogram(data_box)

Пікірлер: 16

  • @Gius3pp3K
    @Gius3pp3K2 жыл бұрын

    Thank you for this brilliant video that has helped me fix a skewed target variable.

  • @zeruyimer3764
    @zeruyimer37642 жыл бұрын

    Still am following you day by day ; Keep it up!

  • @asmegebre1979
    @asmegebre1979 Жыл бұрын

    Thank you very much !!!

  • @asfawadugna2338
    @asfawadugna23382 жыл бұрын

    Thank you it is important lesson !

  • @mcdonalds1499
    @mcdonalds14992 жыл бұрын

    so helpful. thank you sir.

  • @rozymunene8561
    @rozymunene85612 жыл бұрын

    Thank you very much..very clearly explained and detailed. very helpful really. However tried to transform my data using all the test and still not normally distributed. which other test can i use? its multifactorial (3 independent variable) not very big 48 samples.

  • @wakjiratesfahun3682

    @wakjiratesfahun3682

    2 жыл бұрын

    Go ahead. Don't worry for normality because your number of sample is > 30 and it is okay. But I strictly recommend you to check homogeneity of variance. Regards

  • @yaneilys1832
    @yaneilys1832 Жыл бұрын

    BOX -COX 14:32

  • @yolk21
    @yolk212 жыл бұрын

    nice video,, in addition how we can do graphs ?

  • @wakjiratesfahun3682

    @wakjiratesfahun3682

    2 жыл бұрын

    Graph regarding to what?

  • @yolk21

    @yolk21

    2 жыл бұрын

    @@wakjiratesfahun3682 bar graph's, trt and variables. I appreciate your prompt replay

  • @wakjiratesfahun3682

    @wakjiratesfahun3682

    2 жыл бұрын

    Okay I will come up soon.

  • @samihahzura4735
    @samihahzura47352 жыл бұрын

    what is the best for non-parametric two way anova data?

  • @wakjiratesfahun3682

    @wakjiratesfahun3682

    2 жыл бұрын

    Non parametric tests are used when your data isn’t normal. Therefore the key is to figure out if you have normally distributed data. For example, you could look at the distribution of your data. If your data is approximately normal, then you can use parametric test. Up on your request ,I suggest you to use Friedman test.

  • @samihahzura4735

    @samihahzura4735

    2 жыл бұрын

    @@wakjiratesfahun3682 Most of my data isn’t normal. I try to transform the data, but it’s doesn’t have best fit, I tried ART test too. Thanks for your suggestion. It’s great if you can do video for Friedman test!

  • @wakjiratesfahun3682

    @wakjiratesfahun3682

    2 жыл бұрын

    @@samihahzura4735 by the way if you have large sample size you need not to worry about normality. Moreover, I already made a video about Friedman test in my channel. Please check my published videos. Have you tried to transform your data through boxcox and Turkey ladder methods? Additionally, the aforementioned test i.e ART is used for a non-parametric approach to factorial ANOVA that enables you to analyze the interaction as well as the main effects. As usual, ranked data is used, but first, the data for each effect (main or interaction) must be aligned before ranks are calculated. This approach is useful when the data is not normally distributed. It can be used when the homogeneity of variances assumption is violated, although there is a risk of an inflated alpha value (alpha is up to about .07 when set to .05 for the interaction effect and up to .09 for the main effects.