Randomness, Outliers, Bias of Data

1. Random variables and vectors

Examples of Randomness

Roll dice,
Measurement of Height and Weight,
Stock price,
Number of students in the campus every day,
About publication of the book “Harry Potter”,
Number of offers for Graduate School Applications,
Quantum mechanics, Heisenberg’s uncertainty principle in physics, i.e. You can’t know a particle’s position and its velocity at the same time

Results about examples above are not deterministic.

There exists some uncontrolled effects to affect the final results
Really exists randomness, God like to roll dice

Random variables and Random vector

Discrete random variables, results of rolling dice etc.

\(X=\) 1 or 2, 3, 4, 5, 6

continuous random variables, results of measurement of weights, stock prie etc.

\(X=\) 70.2, 69.5, 70.1, 70.0, 69.8, …

Random vectors

\(X=(x_1,x_2,...x_p)\)

Deterministic information contained in the random events

Probability of the results of the random events.
Average result of the random events.
Variation of the results of the random events.
Dependent relationship among the random events.
Dynamic law behind the random event generative process.
Distribution and density function of the results of random events.
- Distribution function for discrete random events, \[F(X=1)=P(X=1), F(X=2)=P(X=2),\ldots, F(X=6)=P(X=6)\]
- Distribution function and density function for continuous random events \[ F(70.2)=P(X<=70.2), f(X=70), etc. \]
- Normal distribution and Multivariate Normal distribution, determined by the average and variation of the results of the random events.
- Distribution with Heavy Tail, Large absolute value results are more easily to be happened.
- Skew Distribution, the distribution function is not about the center or the average point symmetric.

    x<-rnorm(1000, 70, sd=1)
 
    hist(x, freq=FALSE, nclass=20)
    
    f_x <- density(x)
    
    lines(f_x$x, f_x$y, col=4)
    
    f_r <- dnorm(sort(x), mean=70, sd=1)
    
    lines(sort(x), f_r, col=2, lty=2)
    
     library(mvtnorm)
     
     library(ggplot2)

     sigma <- matrix(c(1,0,0,1), ncol=2)
     
     x <- rmvnorm(n=2000, mean=c(1,2), sigma=sigma)
     
     colMeans(x)

## [1] 1.013762 2.014620

     var(x)

##              [,1]         [,2]
## [1,]  0.964431207 -0.004244137
## [2,] -0.004244137  1.017713166

     x<-as.data.frame(x)

     str(x)

## 'data.frame':    2000 obs. of  2 variables:
##  $ V1: num  1.759 0.109 0.842 1.76 1.436 ...
##  $ V2: num  2.53 1.04 3.64 2.06 1.96 ...

     ggplot(x, aes(x=V1, y=V2))+
  
     geom_point(alpha = .2) +
  
     geom_density_2d()+
  
     theme_bw()

     sigma <- matrix(c(1,0.75,0.75,1), ncol=2)
     x <- rmvnorm(n=2000, mean=c(1,2), sigma=sigma)
     colMeans(x)

## [1] 0.9796303 2.0085004

     var(x)

##           [,1]      [,2]
## [1,] 1.0322964 0.7684658
## [2,] 0.7684658 1.0261408

     x<-as.data.frame(x)
     str(x)

## 'data.frame':    2000 obs. of  2 variables:
##  $ V1: num  2.336 0.125 0.442 1.423 0.133 ...
##  $ V2: num  4.493 0.948 1.258 2.625 1.513 ...

     ggplot(x, aes(x=V1, y=V2))+
     geom_point(alpha = .2) +
     geom_density_2d()+
     theme_bw()

     x<-rt(1000, df=10)
 
    hist(x, freq=FALSE, nclass=40)
    
    f_x <- density(x)
    
    lines(f_x$x, f_x$y, col=4)
    
    f_r <- dt(sort(x), df=10)
    
    lines(sort(x), f_r, col=4, lty=2)
    
    f_r <- dnorm(sort(x), mean=0,  sd=1)
    
    lines(sort(x), f_r, col=2, lty=3, lwd=3)

    x<-rchisq(1000, df=3)
 
    hist(x, freq=FALSE, nclass=40)
    
    f_x <- density(x)
    
    lines(f_x$x, f_x$y, col=4)
    
    f_r <- dchisq(sort(x), df=3)
    
    lines(sort(x), f_r, col=4, lty=2)

Summary statistics of Random observations \(x_1, x_2,\ldots,x_n \sim F(x)\).
- Sample mean \(\bar{x}=\frac{1}{n} \sum\limits_{i=1}^n x_i\)
- Median \(x_{(n/2)}\), with half of samples larger than it and half of samples smaller than it.
- Sample variance \(\hat{\sigma}^2_x=\frac{1}{n} \sum\limits_{i=1}^n (x_i-\bar{x})^2\), standard deviation \(\hat{\sigma}\)
- Usually, as the sample size \(n\) is large enough, \[ P(\bar{x}-2\hat{\sigma}_x \le x_{new} \le \bar{x}+2 \hat{\sigma}_x )=P(-2\le\frac{x_i-\bar{x}}{\hat{\sigma}_x}\le 2) >0.95\]
- IQR ( interquartile range, the difference between the 75th percentile and the 25th percentile of a sample). For a Gaussian distribution,
  \[\hat{\sigma}_x \approx 0.7413 \mathrm{IQR}=\mathrm{IQR}/1.349 \] and MAD (Median absolute deviation) \[ \mathrm{MAD} = \mathrm{median}(|x_i-\mathrm{median}(x_1,,x_2,\ldots,x_n)|, i=1,\ldots,n)\] For a Gaussian distribution, \[\hat{\sigma}_x \approx 1.4826 \mathrm{MAD}\]
- Sample Covariance of \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\) \[ \widehat{\mathrm{Cov}}(x,y)=\frac{1}{n}\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})\]
- Sample Correlation \(\hat{\rho}_{xy}\) \[\hat{\rho}_{xy}=\frac{\widehat{\mathrm{cov}}(x,y)}{\hat{\sigma}_x\hat{\sigma}_y}\]
```
   sigma <- matrix(c(1,0,0,1), ncol=2)
   x <- rmvnorm(n=2000, mean=c(1,2), sigma=sigma)
   x<-as.data.frame(x)
   sigma <- matrix(c(1,0.5,0.5,1), ncol=2)
   z <- rmvnorm(n=2000, mean=c(1,2), sigma=sigma)
   z<-as.data.frame(z)
   sigma <- matrix(c(1,0.75,0.75,1), ncol=2)
   y <- rmvnorm(n=2000, mean=c(1,2), sigma=sigma)
   y<-as.data.frame(y)
   sigma <- matrix(c(1,0.-0.75,-0.75,1), ncol=2)
   w <- rmvnorm(n=2000, mean=c(1,2), sigma=sigma)
   w<-as.data.frame(w)
```
```
   par(mfrow=c(2,2), oma=c(0,0,0,0)) 
   plot(x)
   plot(z)
   plot(y)
   plot(w)
```

2. Outliers

In statistics or data science, an outlier is a data point that differs significantly from other observations. Outliers are values at the extreme ends of a dataset.
An outlier may be due to variability in the measurement or it may indicate experimental error; So some outliers represent true values from natural variation in the population. Other outliers may result from incorrect data entry, equipment malfunctions, or other measurement errors. The latter are sometimes excluded from the data set.
An outlier can cause serious problems in data analyses.
True outliers, Examples:

You measure 100-meter running times for a representative sample of 560 college students. Your data are normally distributed with a couple of outliers on either end. Most values are centered around the middle, as expected. But these extreme values also represent natural variations because a variable like running time is influenced by many other factors.

True outliers are also present in variables with skewed distributions where many data points are spread far from the mean in one direction. It’s important to select appropriate statistical tests or measures when you have a skewed distribution or many outliers.

Other outleris that outliers that don’t represent true values can come from many possible sources:
- Measure Errors
- Data entry or processing errors
- Unpresentative sampling
- Examples:
You repeat your running time measurements for a new sample. For one of the participants, you accidentally start the timer midway through their sprint. You record this timing as their running time. This data point is a big outlier in your dataset because it’s much lower than all of the other times.
- Example: Distortion of results due to outliers
You calculate the average running time for all participants using your data. The average is much lower when you include the outlier compared to when you exclude it. Your standard deviation also increases when you include the outlier, so your statistical power is lower as well.

Four ways to find Outliers

You can choose from several methods to detect outliers depending on your time and resources.

Sorting method

You can sort quantitative variables from low to high and scan for extremely low or extremely high values. Flag any extreme values that you find.

This is a simple way to check whether you need to investigate certain data points before using more sophisticated methods.

Example: Sorting method

Your dataset for a pilot experiment consists of 8 values.

180 156 9 176 163 1827 166 171

You sort the values from low to high and scan for extreme values.

9 156 163 166 171 176 180 1872

Using visualizations

You can use software to visualize your data with a box plot, or a box-and-whisker plot, so you can see the data distribution at a glance. This type of chart highlights minimum and maximum values (the range), the median, and the interquartile range for your data.

Many computer programs highlight an outlier on a chart with an asterisk, and these will lie outside the bounds of the graph.

Statistical outlier detection

Statistical outlier detection involves applying statistical tests or procedures to identify extreme values.

By standardizing the data, you can convert extreme data points into z scores that tell you how many standard deviations away they are from the mean.

If a value has a high enough or low enough z score, it can be considered an outlier. As a rule of thumb, values with a z score greater than 3 or less than –3 are often determined to be outliers.

Using the interquartile range

The interquartile range (IQR) tells you the range of the middle half of your dataset. You can use the IQR to create “fences” around your data and then define outliers as any values that fall outside those fences.

This method is helpful if you have a few values on the extreme ends of your dataset, but you aren’t sure whether any of them might count as outliers.

Interquartile range method

Sort your data from low to high
Identify the first quartile (Q1), the median, and the third quartile (Q3).
Calculate your IQR = Q3 – Q1
Calculate your upper fence = Q3 + (1.5 * IQR)
Calculate your lower fence = Q1 – (1.5 * IQR)
Use your fences to highlight any outliers, all values that fall outside your fences.

Your outliers are any values greater than your upper fence or less than your lower fence.

Example: Using the interquartile range to find outliers

We’ll walk you through the popular IQR method for identifying outliers using a step-by-step example.

Your dataset has 11 values. You have a couple of extreme values in your dataset, so you’ll use the IQR method to check whether they are outliers

25 37 24 28 35 22 31 53 41 64 29

Step 1: Sort your data from low to high

First, you’ll simply sort your data in ascending order.

22 24 25 28 29 31 35 37 41 53 64

Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)

The median is the value exactly in the middle of your dataset when all values are ordered from low to high.

Since you have 11 values, the median is the 6th value. The median value is 31.

22 24 25 28 29 31 35 37 41 53 64

Next, we’ll use the exclusive method for identifying Q1 and Q3. This means we remove the median from our calculations.

The Q1 is the value in the middle of the first half of your dataset, excluding the median. The first quartile value is 25.

22 24 25 28 29

Your Q3 value is in the middle of the second half of your dataset, excluding the median. The third quartile value is 41.

35 37 41 53 64

Step 3: Calculate your IQR

The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the IQR.

\[\mathrm{IQR} = Q3 – Q1=41-26=15\]

Step 4: Calculate your upper fence

The upper fence is the boundary around the third quartile. It tells you that any values exceeding the upper fence are outliers.

\[\mathrm{Upper fence}=Q3+1.5*\mathrm{IQR}=41+1.5*15=41+22.5=63.5\]

Step 5: Calculate your lower fence

The lower fence is the boundary around the first quartile. Any values less than the lower fence are outliers.

\[\mathrm{Lower fence}=Q1-1.5*\mathrm{IQR}=26-1.5*15=26-22.5=3.5\]

Step 6: Use your fences to highlight any outliers

Go back to your sorted dataset from Step 1 and highlight any values that are greater than the upper fence or less than your lower fence. These are your outliers.

Upper fence = 63.5
Lower fence = 3.5

22 24 25 28 29 31 35 37 41 53 64

You find one outlier, 64, in your dataset.

Dealing with outliers

Once you’ve identified outliers, you’ll decide what to do with them. Your main options are retaining or removing them from your dataset.

For each outlier, think about whether it’s a true value or an error before deciding.

Does the outlier line up with other measurements taken from the same participant? Is this data point completely impossible or can it reasonably come from your population? What’s the most likely source of the outlier? Is it a natural variation or an error? In general, you should try to accept outliers as much as possible unless it’s clear that they represent errors or bad data.

Retain outliers

Just like with missing values, the most conservative option is to keep outliers in your dataset. Keeping outliers is usually the better option when you’re not sure if they are errors.

With a large sample, outliers are expected and more likely to occur. But each outlier has less of an impact on your results when your sample is large enough. The central tendency and variability of your data won’t be as affected by a couple of extreme values when you have a large number of values.

If you have a small dataset, you may also want to retain as much data as possible to make sure you have enough statistical power. If your dataset ends up containing many outliers, you may need to use a statistical test that’s more robust to them. Non-parametric statistical tests perform better for these data.

Remove outliers

Outlier removal means deleting extreme values from your dataset before you perform analyses. You aim to delete any dirty data while retaining true extreme values.

It’s a tricky procedure because it’s often impossible to tell the two types apart for sure. Deleting true outliers may lead to a biased dataset and an inaccurate conclusion.

For this reason, you should only remove outliers if you have legitimate reasons for doing so. It’s important to document each outlier you remove and your reasons so that other researchers can follow your procedures.

3. Bias in Data

Types of data bias: Though not exhaustive, this list contains common examples of data bias in the field, along with examples of where it occurs.

Sample bias: Sample bias occurs when a dataset does not reflect the realities of the environment in which a model will run. An example of this is certain facial recognition systems trained primarily on images of white men. These models have considerably lower levels of accuracy with women and people of different ethnicities. Another name for this bias is selection bias.
Exclusion bias: Exclusion bias is most common at the data preprocessing stage. Most often it’s a case of deleting valuable data thought to be unimportant. However, it can also occur due to the systematic exclusion of certain information. For example, imagine you have a dataset of customer sales in America and Canada. 98% of the customers are from America, so you choose to delete the location data thinking it is irrelevant. However, this means you model will not pick up on the fact that your Canadian customers spend two times more.
Measurement bias: This type of bias occurs when the data collected for training differs from that collected in the real world, or when faulty measurements result in data distortion. A good example of this bias occurs in image recognition datasets, where the training data is collected with one type of camera, but the production data is collected with a different camera. Measurement bias can also occur due to inconsistent annotation during the data labeling stage of a project.
Recall bias: This is a kind of measurement bias, and is common at the data labeling stage of a project. Recall bias arises when you label similar types of data inconsistently. This results in lower accuracy. For example, let’s say you have a team labeling images of phones as damaged, partially-damaged, or undamaged. If someone labels one image as damaged, but a similar image as partially damaged, your data will be inconsistent.
Observer bias: Also known as confirmation bias, observer bias is the effect of seeing what you expect to see or want to see in data. This can happen when researchers go into a project with subjective thoughts about their study, either conscious or unconscious. We can also see this when labelers let their subjective thoughts control their labeling habits, resulting in inaccurate data.
Racial bias: Though not data bias in the traditional sense, this still warrants mentioning due to its prevalence in AI technology of late. Racial bias occurs when data skews in favor of particular demographics. This can be seen in facial recognition and automatic speech recognition technology which fails to recognize people of color as accurately as it does caucasians.
Association bias: This bias occurs when the data for a machine learning model reinforces and/or multiplies a cultural bias. Your dataset may have a collection of jobs in which all men are doctors and all women are nurses. This does not mean that women cannot be doctors, and men cannot be nurses. However, as far as your machine learning model is concerned, female doctors and male nurses do not exist. Association bias is best known for creating gender bias.
Survivorship Bias: It’s easier to focus on the winners rather than the runners-up. If you think back to your favorite competition from the 2016 Olympics, it’s probably pretty tough to recall who got the silver and bronze. Survivorship bias influences us to focus on the characteristic of winners, due to a lack of visibility of other samples—confusing our ability to discern correlation and causation.
Availability Bias: Availability of data has a big influence on how we view the world—but not all data is investigated and weighed equally. Have you ever found yourself wondering if crime has increased in your neighborhood because you’ve seen a broken car window? You’ve seen a vivid clue that something might be going on, but since you probably didn’t go on to investigate crime statistics, it’s likely that your perception shifted based on the immediately available information.
Historical data bias: occurs when socio-cultural prejudices and beliefs are mirrored into systematic processes. This becomes particularly challenging when data from historically-biased sources are used to train machine learning models—for example, if manual systems give certain groups of people poor credit ratings, and you’re using that data to train the automatic system, the automatic system will replicate and may amplify the original system’s biases.

Cognitive Biases

How do I avoid data bias in Analysis?

The prevention of data bias in learning is an ongoing process. Though it is sometimes difficult to know when your machine learning algorithm, data or model is biased, there are a number of steps you can take to help prevent bias or catch it early. Though far from a comprehensive list, the bullet points below provide an entry-level guide for thinking about data bias for data analysis.

To the best of your ability, research your users in advance. Be aware of your general use-cases and potential outliers.
Ensure your team of data scientists and data labelers is diverse.
Where possible, combine inputs from multiple sources to ensure data diversity.
Create a gold standard for your data labeling. A gold standard is a set of data that reflects the ideal labeled data for your task. It enables you to measure your team’s annotations for accuracy.
Make clear guidelines for data labeling expectations so data labelers are consistent.
Use multi-pass annotation for any project where data accuracy may be prone to bias. Examples of this include sentiment analysis, content moderation, and intent recognition.
Enlist the help of someone with domain expertise to review your collected and/or annotated data. Someone from outside of your team may see biases that your team has overlooked.
Analyze your data regularly. Keep track of errors and problem areas so you can respond to and resolve them quickly. Carefully analyze data points before making the decision to delete or keep them.
Make bias testing a part of your development cycle

4. Signal to Noise Ratio

In terms of definition, SNR or signal-to-noise ratio is the ratio between the desired information or the power of a signal and the undesired signal or the power of the background noise.
Also, SNR is a measurement parameter in use in the fields of science and engineering that compares the level of the desired signal to the level of background noise. In other words, SNR is the ratio of signal power to the noise power.

     sigma <- 1
     x <- rnorm(n=2000, mean=0, sd=1)
     error <- rnorm(2000, mean=0, sd=1)
     y1 <-x+error
     error <- rnorm(2000, mean=0, sd=0.75)
     y2 <-x+error
     error <- rnorm(2000, mean=0, sd=0.5)
     y3 <-x+error
     error <- rnorm(2000, mean=0, sd=0.25)
     y4 <-x+error
     
     par(mfrow=c(2,2), oma=c(0,0,0,0)) 
     plot(x,y1)
     plot(x,y2)
     plot(x,y3)
     plot(x,y4)

Aside from the technical definition of SNR, the way it in other terms is by using a comparative. For example, say that you and one other person are inside a large room having a conversation. However, the room is full of other people who are also having conversations. Furthermore, a few of the other individuals also have similar voice patterns to you and the other individual involved in your discussion. As you can imagine, it would be difficult to decipher which person is saying what.
Noise = Random Noise + redundant information

Randomness, Outliers, Bias of Data

PENG Heng

2022-09-15

1. Random variables and vectors

2. Outliers

Four ways to find Outliers

Dealing with outliers

3. Bias in Data

4. Signal to Noise Ratio