1. Random variables and vectors

 

Examples of Randomness



Results about examples above are not deterministic.

 

Random variables and Random vector

\(X=\) 1 or 2, 3, 4, 5, 6

\(X=\) 70.2, 69.5, 70.1, 70.0, 69.8, …

\(X=(x_1,x_2,...x_p)\)

 

Deterministic information contained in the random events

    x<-rnorm(1000, 70, sd=1)
 
    hist(x, freq=FALSE, nclass=20)
    
    f_x <- density(x)
    
    lines(f_x$x, f_x$y, col=4)
    
    f_r <- dnorm(sort(x), mean=70, sd=1)
    
    lines(sort(x), f_r, col=2, lty=2)
    
     library(mvtnorm)
     
     library(ggplot2)

     sigma <- matrix(c(1,0,0,1), ncol=2)
     
     x <- rmvnorm(n=2000, mean=c(1,2), sigma=sigma)
     
     colMeans(x)
## [1] 1.013762 2.014620
     var(x)
##              [,1]         [,2]
## [1,]  0.964431207 -0.004244137
## [2,] -0.004244137  1.017713166
     x<-as.data.frame(x)

     str(x)
## 'data.frame':    2000 obs. of  2 variables:
##  $ V1: num  1.759 0.109 0.842 1.76 1.436 ...
##  $ V2: num  2.53 1.04 3.64 2.06 1.96 ...
     ggplot(x, aes(x=V1, y=V2))+
  
     geom_point(alpha = .2) +
  
     geom_density_2d()+
  
     theme_bw()

     sigma <- matrix(c(1,0.75,0.75,1), ncol=2)
     x <- rmvnorm(n=2000, mean=c(1,2), sigma=sigma)
     colMeans(x)
## [1] 0.9796303 2.0085004
     var(x)
##           [,1]      [,2]
## [1,] 1.0322964 0.7684658
## [2,] 0.7684658 1.0261408
     x<-as.data.frame(x)
     str(x)
## 'data.frame':    2000 obs. of  2 variables:
##  $ V1: num  2.336 0.125 0.442 1.423 0.133 ...
##  $ V2: num  4.493 0.948 1.258 2.625 1.513 ...
     ggplot(x, aes(x=V1, y=V2))+
     geom_point(alpha = .2) +
     geom_density_2d()+
     theme_bw()

     x<-rt(1000, df=10)
 
    hist(x, freq=FALSE, nclass=40)
    
    f_x <- density(x)
    
    lines(f_x$x, f_x$y, col=4)
    
    f_r <- dt(sort(x), df=10)
    
    lines(sort(x), f_r, col=4, lty=2)
    
    f_r <- dnorm(sort(x), mean=0,  sd=1)
    
    lines(sort(x), f_r, col=2, lty=3, lwd=3)

    x<-rchisq(1000, df=3)
 
    hist(x, freq=FALSE, nclass=40)
    
    f_x <- density(x)
    
    lines(f_x$x, f_x$y, col=4)
    
    f_r <- dchisq(sort(x), df=3)
    
    lines(sort(x), f_r, col=4, lty=2)

 

2. Outliers

You measure 100-meter running times for a representative sample of 560 college students. Your data are normally distributed with a couple of outliers on either end. Most values are centered around the middle, as expected. But these extreme values also represent natural variations because a variable like running time is influenced by many other factors.

 

True outliers are also present in variables with skewed distributions where many data points are spread far from the mean in one direction. It’s important to select appropriate statistical tests or measures when you have a skewed distribution or many outliers.

 

Four ways to find Outliers

You can choose from several methods to detect outliers depending on your time and resources.

You can sort quantitative variables from low to high and scan for extremely low or extremely high values. Flag any extreme values that you find.

This is a simple way to check whether you need to investigate certain data points before using more sophisticated methods.

 

Example: Sorting method


Your dataset for a pilot experiment consists of 8 values.

180 156 9 176 163 1827 166 171

You sort the values from low to high and scan for extreme values.

9 156 163 166 171 176 180 1872


 

You can use software to visualize your data with a box plot, or a box-and-whisker plot, so you can see the data distribution at a glance. This type of chart highlights minimum and maximum values (the range), the median, and the interquartile range for your data.

Many computer programs highlight an outlier on a chart with an asterisk, and these will lie outside the bounds of the graph.

 

Statistical outlier detection involves applying statistical tests or procedures to identify extreme values.

By standardizing the data, you can convert extreme data points into z scores that tell you how many standard deviations away they are from the mean.

If a value has a high enough or low enough z score, it can be considered an outlier. As a rule of thumb, values with a z score greater than 3 or less than –3 are often determined to be outliers.

 

The interquartile range (IQR) tells you the range of the middle half of your dataset. You can use the IQR to create “fences” around your data and then define outliers as any values that fall outside those fences.

This method is helpful if you have a few values on the extreme ends of your dataset, but you aren’t sure whether any of them might count as outliers.

 

Interquartile range method

 

Your outliers are any values greater than your upper fence or less than your lower fence.

 

Example: Using the interquartile range to find outliers

We’ll walk you through the popular IQR method for identifying outliers using a step-by-step example.

Your dataset has 11 values. You have a couple of extreme values in your dataset, so you’ll use the IQR method to check whether they are outliers

25 37 24 28 35 22 31 53 41 64 29

Step 1: Sort your data from low to high

First, you’ll simply sort your data in ascending order.

22 24 25 28 29 31 35 37 41 53 64

 

Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)

The median is the value exactly in the middle of your dataset when all values are ordered from low to high.

Since you have 11 values, the median is the 6th value. The median value is 31.

22 24 25 28 29 31 35 37 41 53 64

 

Next, we’ll use the exclusive method for identifying Q1 and Q3. This means we remove the median from our calculations.

The Q1 is the value in the middle of the first half of your dataset, excluding the median. The first quartile value is 25.

22 24 25 28 29

 

Your Q3 value is in the middle of the second half of your dataset, excluding the median. The third quartile value is 41.

35 37 41 53 64

 

Step 3: Calculate your IQR

The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the IQR.

\[\mathrm{IQR} = Q3 – Q1=41-26=15\]

 

Step 4: Calculate your upper fence

The upper fence is the boundary around the third quartile. It tells you that any values exceeding the upper fence are outliers.

\[\mathrm{Upper fence}=Q3+1.5*\mathrm{IQR}=41+1.5*15=41+22.5=63.5\]

 

Step 5: Calculate your lower fence

The lower fence is the boundary around the first quartile. Any values less than the lower fence are outliers.

\[\mathrm{Lower fence}=Q1-1.5*\mathrm{IQR}=26-1.5*15=26-22.5=3.5\]

 

Step 6: Use your fences to highlight any outliers

Go back to your sorted dataset from Step 1 and highlight any values that are greater than the upper fence or less than your lower fence. These are your outliers.

22 24 25 28 29 31 35 37 41 53 64

You find one outlier, 64, in your dataset.

 

Dealing with outliers

 

Once you’ve identified outliers, you’ll decide what to do with them. Your main options are retaining or removing them from your dataset.

For each outlier, think about whether it’s a true value or an error before deciding.

Does the outlier line up with other measurements taken from the same participant? Is this data point completely impossible or can it reasonably come from your population? What’s the most likely source of the outlier? Is it a natural variation or an error? In general, you should try to accept outliers as much as possible unless it’s clear that they represent errors or bad data.

 

Just like with missing values, the most conservative option is to keep outliers in your dataset. Keeping outliers is usually the better option when you’re not sure if they are errors.

With a large sample, outliers are expected and more likely to occur. But each outlier has less of an impact on your results when your sample is large enough. The central tendency and variability of your data won’t be as affected by a couple of extreme values when you have a large number of values.

If you have a small dataset, you may also want to retain as much data as possible to make sure you have enough statistical power. If your dataset ends up containing many outliers, you may need to use a statistical test that’s more robust to them. Non-parametric statistical tests perform better for these data.

 

Outlier removal means deleting extreme values from your dataset before you perform analyses. You aim to delete any dirty data while retaining true extreme values.

It’s a tricky procedure because it’s often impossible to tell the two types apart for sure. Deleting true outliers may lead to a biased dataset and an inaccurate conclusion.

For this reason, you should only remove outliers if you have legitimate reasons for doing so. It’s important to document each outlier you remove and your reasons so that other researchers can follow your procedures.

 

3. Bias in Data

Types of data bias: Though not exhaustive, this list contains common examples of data bias in the field, along with examples of where it occurs.

 

Cognitive Biases

 

How do I avoid data bias in Analysis?

The prevention of data bias in learning is an ongoing process. Though it is sometimes difficult to know when your machine learning algorithm, data or model is biased, there are a number of steps you can take to help prevent bias or catch it early. Though far from a comprehensive list, the bullet points below provide an entry-level guide for thinking about data bias for data analysis.

 

4. Signal to Noise Ratio

     sigma <- 1
     x <- rnorm(n=2000, mean=0, sd=1)
     error <- rnorm(2000, mean=0, sd=1)
     y1 <-x+error
     error <- rnorm(2000, mean=0, sd=0.75)
     y2 <-x+error
     error <- rnorm(2000, mean=0, sd=0.5)
     y3 <-x+error
     error <- rnorm(2000, mean=0, sd=0.25)
     y4 <-x+error
     
     par(mfrow=c(2,2), oma=c(0,0,0,0)) 
     plot(x,y1)
     plot(x,y2)
     plot(x,y3)
     plot(x,y4)