Some information about Z-test.
A Z-test is a statistical test which is used to determine whether 2 means are different when their variances are known and the size is large. The test statistic is assumed to have a normal distribution and parameters like standard deviation should be known to perform an accurate Z- test.
TRAFFIC<-read.csv('https://raw.githubusercontent.com/kunal0895/RDatasets/master/TRAFFIC.csv')
summary(TRAFFIC) #gives us the statistics
#data clean and subset, either
lincoln.data <- subset(TRAFFIC, TRAFFIC$TUNNEL == "Lincoln")
holland.data <- subset(TRAFFIC, TRAFFIC$TUNNEL == "Holland")
#traffic at lincoln
#This variable is a column of 1401 rows.
lincoln.traffic <- lincoln.data$VOLUME_PER_MINUTE
#traffic at holland
#This variable is a column of 1401 rows.
holland.traffic <- holland.data$VOLUME_PER_MINUTE
# standard deviation of two samples.
#The final value is the standard deviation, in Volume per minute.
sd.lincoln <- sd(lincoln.traffic)
sd.holland <- sd(holland.traffic)
# means of two samples
mean.lincoln <- mean(lincoln.traffic)
mean.holland <- mean(holland.traffic)
#length of lincoln and holland
len_lincoln <- length(lincoln.traffic)
len_holland <- length(holland.traffic)
#standard deviation of difference traffic
sd.lin.hol <- sqrt(sd.lincoln^2/len_lincoln + sd.holland^2/len_holland)
#z score
zeta <- (mean.lincoln - mean.holland)/sd.lin.hol
zeta
#plot red line
plot(x=seq(from = -5, to= 5, by=0.1),y=dnorm(seq(from = -5, to= 5, by=0.1),mean=0),type='l',xlab = 'mean difference', ylab='possibility')
abline(v=zeta, col='red')
#get p
p = 1-pnorm(zeta)
p
The seq function here takes the starting value (from), and the ending value(to), and the interval value (by). There are 4 kinds of functions:
1) dnorm
2) pnorm
3) rnorm
4) qnorm
Let's see each of them one by one. The functions have 2 parts, the first letter (d, p, r, q) where each of the letters means something and the word 'norm' means the normal distribution. There are various kinds of distributions, but we are interested in the normal distribution.
By prefixing a "d" to the function name (norm in our case), you can get probability density values (pdf). By prefixing a "p", you can get cumulative probabilities (cdf). By prefixing a "q", you can get quantile values. By prefixing an "r", you can get random numbers from the distribution.
dnorm(x, mean = 0, sd = 1, log = FALSE)
Here, x is considered a z-score value by default, where mean = 0 and SD = 1.
The function dnorm returns the probability density function (d is the density, hence dnorm) here, known as a PDF. What is a PDF?
It is the function of a continuous random variable, whose integral across an interval gives us the probability that the value of the variable lies in that interval. Hence the result is, for each value in the sequence, we get a probability on the Y-axis. If log=TRUE, the output we get on the y axis is log of the probability values, instead of the probability values that we get right now. The default value of log is false.
2) The pnorm function is the cumulative density function or CDF. It returns the area below the given value of "x". Again, mean=0 and sd=1 by default.
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
log.p is similar to the previous case, returns the Y-axis values in terms of log(p) instead of p, if set to true.
lower.tail = TRUE (by default) means that the value shown is P[X <= x]. It returns the are below the given value of X.
As a result, 1 - pnorm(1) is equivalent to pnorm(1, lower.tail = FALSE), both of which equal to 0.1586553.
3) qnorm has similar parameters to pnorm, the difference being that it gives us the quantile (critical values).
The idea behind qnorm is that you give it a probability, and it returns the number whose cumulative distribution matches the probability. For example, if you have a normally distributed random variable with mean zero and standard deviation one, then if you give the function a probability it returns the associated Z-score.
4) The last function we examine is the rnorm function which can generate random numbers whose distribution is normal. The argument that you give it is the number of random numbers that you want, and it has optional arguments to specify the mean and standard deviation. If you specify some mean and SD, the numbers will be generated such that their mean and SD is equal to what we specified. Running rnorm(x) with the same x values will generate different values since the generated numbers are random.