Hypothesis testing

Wheel of Destiny Question

“Back before COVID, Stella McStat has been running a small-time gambling operation on campus for several months. For a dollar each, Stella sells one red and one black ticket for each spin of a wheel (i.e., total $2), then she spins the Wheel of Destiny. The person who holds the colour where the spinner stops gets $1.75 (Stella keeps $0.25 per spin for running the game and providing snacks).

Stella is setting up a new online gambling operation (streaming on Twitch) and just bought a new spinner, the critical piece of equipment for this game. Before she beings using this spinner, she wants to make sure that it is, in fact, fair—she wants both colours to come up equally often. Because of the set-up of the game, Stella has no incentive to cheat and wants the game to be as fair as possible.”

Test the spinner

Let’s say you spun the spinner 50 times and got red 32 times.

Based on our observed data, can you answer Stella’s question? Is the Wheel of Destiny fair?

We need to compare our observed value to something: the distribution of what we would expect to see if the spinner really was fair…

Core Question: If the spinner was fair, out of 50 spins, what number of reds would be usual/unusual?

Histogram: If you were to spin the spinner 50 times, and record the number of reds, and repeated 100 times (100sets of 50 spins with record number of reds in each set of 50 recorded) what would this histogram look like?

Generalize the program

If the spinner is fair, then the proportion of red in 50 spins would be equivalent to the proportion of heads in 50 flips of a fair coin (both 50/50 chance)

Possible explanations for what is observed

The new Wheel of Destiny is fair, what is observed 32/50 is gotten by chance

The new Wheel of Destiny is not fair

Where do things stand now

Conclude for certain whether the new spinner is fair or unfair?

Calculate the probability that the new spinner is fair?

Calculate the probability that the new spinner is unfair?

Answer Stella’s question

Conclusion A: Our sample results are consistent with results we would observe if the spinner is fair

Conclusion B: Our sample results are not consistent with the results we should observe if the spinner was fair. In other words, we have evidence against the hypothesis that the spinner is fair

Statistical Inference

Statistical inference - the process of coming to conclusions or decisions based on statistical information (information object to randomness and uncertainty)
- The conclusions are uncertain (can’t be 100% sure that it represents the truth because the given information is incomplete)
Ex. Making conclusion about population using data from a random sample
Parameter ( $p$ or $π$ ) - the number that describes the population (for the population focused on)
Statistic - the number that describes the sample (change from sample to sample)

Ex. Sample mean, median, variance, etc.
Test statistic ( $\overset{p}{^}$ or $\overset{π}{^}$ ) - a special statistic that decides whether the data is compatible with $H_{0}$

Wheel fo Destiny:

| Term | Wheel of Destiny example | | :-------------------- | ---------------------------------------------------- | | Population | All spins of the wheel of destiny | | Parameter | The true(long-run) probability of spinner in red | | Sample | The 50 observed spins | | Test statistic | The sample’s proportions of red in the 50 spins |

Sampling distribution of statistics(proportion) - the distribution fo statistic values taken for all possible samples of the same size ( $n$ ) from the same population

Hypothesis Testing

Hypothesis Testing is one type of statistical inference
There is always an element of chance in any sampling. What we are curious about is if chance is acting alone or if what we see is due to chance AND something else.
Sometimes statistical inference is not appropriate

Ex: if we have data for ALL individuals in the population, there is nothing to infer.
Null hypothesis - use the notation $H_{0}$ (said as “aitch naught”), the default setting
- There is not enough evidence against the claim that the defendant is innocent (Think about the null hypothesis as the ‘boring’ or ‘status quo’ or ‘nothing–going-on-option’)
There is no differences in…
- $H_{0} : p a r am e t er = v a l u e$ ( $H_{0}$ is a table, don’t put equal marks with it)
- p-value - the probability of observing data that are at least as unusual (or at least as extreme) as the sample data, under the assumption that $H_{0}$ is true
  - Estimate it as the proportion of values in the estimated sampling distribution that are as extreme or more extreme than the test statistic calked from the observed sample data
Alternative hypothesis - use the notation $H_{1}$ , covers everything that is not the value in the null
- There is enough evidence against the claim that the defendant is innocent
- There is a difference in…
- $H_{1} : p a r am t er \neq = v a l u e$

Steps for Hypothesis Testing (one proportion)

State hypothesis $H_{0}$ and $H_{1}$
Calculate a test statistic based on the observed sample data

Simulate samples under the null hypothesis, and calculate the statistic for each one (estimate sampling distribution)

Goal: Explore the distribution of values of the statistic if $H_{0}$ is true. What kind of results are common under the null? What are unusual?

Simulation: a way to explore random events (use R) (assume $H_{0}$ is true)

Using sample() function that create a sample for the simulation

simdata <- <data> %>%
	mutate(<varaible> = sample(<variable))

Set values for simulation (sample size, number of repetitions, seed)

sample(c("p1, p2"),  # vecor with all possible outcomes
       size = <int>,  # num of values in sample
       prob = c(<dbl>, <dbl>),  # prob of each outcome (ading to 1)
								# (default is equal to probs)
       replace = TRUE)  # can outcomes be repeated (default=FALSE)

Use a for loop to simulate many random samples and calculate the statistic of interest of each one

set.seed(<int>)  # Set a starting point (make sure its the same across simulation)

for (i in 1:<int>)
{
    SOME CODE (loop through sample)
}

Turn results into a data frame (tibble()) so ggplot can be used

<str> <- tibble(<variable> = mean(<sample> == "<interested_condition>"))

Plot the results

# Dotplot
<data> %>% ggplot(aes(x=<variable>)) +
	geom_dotplot() + xlim(0,<int>) + ylim(0,<int>) # <int> is upper limit

Evaluate the evidence against $H_{0}$ by calculating the p-value (probability of more extreme than observed sample)
- 2 possible reasons for small p-value
  1. $H_{0}$ Is true and the observed case is an unlikely extreme value of the statistic
  2. $H_{0}$ Is not true
- The smaller the p-value, the more “evidence” is against $H_{0}$ , possible reasons
- P-value: all values great or equal to ( $p + ∣ \overset{p}{^} - p ∣$ ) and all values less than or equal to ( $p - ∣ \overset{p}{^} - p ∣$ ). This is a two-sided test (considers both large and smaller samples)

Make a conclusion

p-value	Evidence
0.10 < p-value	No evidence against $H_{0}$
0.05 < p-value < 0.10	Weak evidence against $H_{0}$
0.01 < p-value < 0.05	Moderate evidence against $H_{0}$ (default $α$ )
0.001 < p-value < 0.01	Strong evidence against $H_{0}$
p-value < 0.001	Very strong evidence against $H_{0}$

Statistical significant (or statistically significance difference)
- A significance level ( $α$ ) set in advance determines the cut-off for how unusual/extreme test statistic has to be (assuming $H_{0}$ is true) in order to reject the assumption that $H_{0}$ is true
- Reject $H_{0}$ if p-value $\leq α$
- $α$ Can be chosen to be any number, but typically $α = 0.05$

It is better to report the p-value and comment of the strength of evidence against $H_{0}$ instead of only reporting whether the result is/isn’t statistically significant

$α$	Ex. 1 (Fail to reject)	Ex. 2 (Reject)
0.1/ 10%	P-value = 0.21; “At the 10% significance level we fail to reject $H_{0}$ ”	P-value = 0.093; “At the 10% significance level we can reject $H_{0}$ ”
0.05/ 5%	P-value = 0.093; “At the 10% significance level we fail to reject $H_{0}$ ”	P-value = 0.037; “At the 10% significance level we can reject $H_{0}$ ”
0.01/ 1%	P-value = 0.037; “At the 10% significance level we fail to reject $H_{0}$ ”	P-value = 0.007; “At the 10% significance level we can reject $H_{0}$ ”

We had a question about Stella;s new Wheel of Destiny: Is the new Wheel of Destiny fair

State hypothesis

The Wheel of Destiny spinner is fair: $H_{0} : p_{re d} = 0.5$ ,

The Wheel of Destiny spinner is not fair: $H_{1} : p_{re d} \neq = 0.5$

Where $p_{re d}$ is the proportion of spins of the new Wheel of Destiny that land on red

We observed the results of 50 spins and calculated the proportion fo red outcomes
Calculate a test statistic: $\overset{p}{^}_{re d} = \frac{32}{50} = 0.64$
test_stat <- 32/50
We looked at the distribution of for 50 spins of a fair spinner (many times); instead of a spinner, lots of students flipped a fair coin 50 times and recorded the proportion of heads (equivalent because both are 50/50 processes)
Estimated sampling distribution: assumptions of a 50/50 process ( $H_{0}$ )
# (1) Set values for simulation
n_observations <- 50  # num of observations
repetitions <- 1000  # num of simulations
simulated_stats <- rep(NA, repetitions)  # 1000 missing vlaues to start

# (2) Automate simulatio for a for loop
for (i in 1:repetitions)
{
    new_sim <- sample(c("red", "black"),  
                      size = n_observations,
                      prob = c(0.5, 0.5),  
                      replace = TRUE)
    sim_p <- sum(new_sim == "red") / n_observations

    simulated_stats[i] <- simp_p;  # add new value to vector of results
}

# (3) Turn results into a data frame
sim <- tibble(p_red = simulated_stats)

# (4) Plot results
sim %>% ggplot(aes(x=p_heads)) +
	geom_histogram() +
	xlabs("Proportion of red coutcomes in 50 spins of a 
		  fair Wheel of Destiny \n (p_red = 0.5)")
We compared our observed value to the distribution of proportions observed from the fair spinner (or fair coin) to assess if the new spinner’s behaviour is consistent with the behaviour of a fair spinner
Calculating the p-value (we have only done this informally so far)

$H_{0} : p_{re d} = 0.5$

Test statistic $\overset{p}{^}_{re d} = \frac{32}{50} = 0.64$

All values great or equal to $\overset{p}{^}_{re d}$ and all values less than or equal to $0.5 - ∣ \overset{p}{^}_{re d} - 0.5∣$ .
test_stat <- 32/50
sim %>% gglot(aes(x=p_red)) +
	geom_histogram() + 
	geom_vline(xintercept = 0.5 - abs(0.5 - test_stat), color = "red") +
	geom_vline(xintercept = -.5 + abs(0.5 - test_stat), color = "blue") +
	labs(x = "Simulated proportions red outcomes form a fair spinner
         (proportions based on samples of size n=50)")
pvalue <- sim %>%
	filter(p_red >= 0.64 | p_red <= 0.36) %>%
	summarize(p_value = n() / repetitions)
as.numeric(pvalue)

## [1] 0.065
The smaller the p-value, the more “evidence” is against $H_{0}$
We answered Stella’s question based on the data we collected.

Make a conclusion:

Since the p-value is 0.065, we could conclude that we have weak evidence against the null hypothesis that the Wheel of Destiny spinner is fair. (Not enough evidence to reject $H_{0}$ )

The result of our testing protocol wasn’t so unusual that we would claim the spinner isn’t fair…but it is a little borderline

Errors

	Fail to reject $H_{0}$	Reject $H_{0}$
$H_{0}$ Is true	👍🏼	Type 1 error
$H_{1}$ Is true	Type 2 error	👍🏼

Type 1 error - reject $H_{0}$ when $H_{0}$ is true
- Even when we set chance to be very small (i.e. need a very extreme/unusual observed test statistic to reject $H_{0}$ , we could still observe a very unusual outcome and end up rejecting $H_{0}$ when we should not).
Type 2 error - fail to reject $H_{0}$ when $H_{0}$ is false (and should be rejected)
- When we don’t reject a null hypothesis (i.e. the results don’t look unusual compared to the sampling distribution assuming $H_{0}$ is true), it is still possible that $H_{0}$ may not be true.

In second year Stats:

Determine the sampling distribution exactly by using:
Binomial probability model - used to count the number of “successes” in independent trials, where each trial has two possible outcomes: “success” with probability or “failure” with probability $(1 - p)$
- Probability of successes in trial is $C_{n}^{k} p^{k} (1 - p)^{n - k}$

MEL.ZHU

Explorer

blogs

ENG

UofT Notes

ANT100

COG250

CSC111

MAT223

MGT100

PSY220

PSY230

PSY240

PSY260

PSY322

PSY424

STA130

STA130: Lecture_4&5

Hypothesis testing

Wheel of Destiny Question

Test the spinner

Generalize the program

Statistical Inference

Wheel fo Destiny:

Hypothesis Testing

Steps for Hypothesis Testing (one proportion)

Errors

Graph View

Table of Contents

Backlinks

MEL.ZHU

Explorer

blogs

ENG

UofT Notes

ANT100

COG250

CSC111

MAT223

MGT100

PSY220

PSY230

PSY240

PSY260

PSY322

PSY424

STA130

STA130: Lecture_4&5

Hypothesis testing §

Wheel of Destiny Question §

Test the spinner §

Generalize the program §

Statistical Inference §

Wheel fo Destiny: §

Hypothesis Testing §

Steps for Hypothesis Testing (one proportion) §

Errors §

Graph View

Table of Contents

Backlinks

Hypothesis testing

Wheel of Destiny Question

Test the spinner

Generalize the program

Statistical Inference

Wheel fo Destiny:

Hypothesis Testing

Steps for Hypothesis Testing (one proportion)

Errors