Intro to Statistics
- Statistics - arts and science of learning from data. Its a way to quantify uncertainty
- Data Science - includes the computational aspects and acquiring, managing, and analyzing data
- Reasoning and computing are both important aspects in data science
Statistical Thinking
- Always be aware of the assumptions made, are they justified?
- Think about where the data came from
- Is the data a random sample (representatives of all sample) ?
- Survivorship Bias - only data that survived the phenomena were left, therefore these data would not be random sample
- Are the data and the assumptions correlated?
- Is the data a random sample (representatives of all sample) ?
Introduction to R
and RStudio
- Console - excites each line of code
- Source - open, view, and edit code and save for later
- Environment - list objects (variables, data frames)
-
Assigning variables: using
<-
operator -
Comment using
#
symbol in R -
Element - a single piece of data, can be one fo several data types:
Data Type | R Sytax | Description | Examples |
---|---|---|---|
Integer | int | Numbers (integers) | -4, -2, 1, 3 |
Double | dbl | Numbers (floats) | 2, 2.02, 22222 |
Logical | lgl | Booleans | TRUE, FALSE |
Character | chr | Strings | “I”, “I love stats” |
Factor | fct | Characters taken only out of a pre specified list | “China” out of Asian |
-
Vector - can be made by grouping values to the same data type (simplest data structure)
c()
combines single elements into a vector- Use
is.
Function to check the data type [is.numeric()
,is.character()
] - Similar to list?
-
R switches between data types automatically for some operations
- Logical -> Numeric -> Character | Logical -> Character
> sum(c(TRUE, FALSE)) == sum(c(1, 0)) [1] TRUE > sum(c(TRUE, FALSE)) [1] 1
-
Data Frame - stores data sets
- Row: individual records; Column: variables
- A data frame can contain multiple type of data, but within a column every cell must be the same type of data
-
Packages in R provide collections of functions and data sets in addition to the things ‘base R’ can do
- Function - a shortcut to run a bunch of code
- Parameters - a preset setting for the expected input
- Argument - provided input
- tydyverse - need to be loaded into every problem set using the
library()
functionread_csv()
is a function that load data in, the resulting object type is called a tibbleglimpse()
out the number of rows and colum and listing out the column names, its data type and a few of first values (a summary out the data)head()
shows the top couple rows of the data (first 6 by default)
- Function - a shortcut to run a bunch of code
-
Pipe(
%>%
) - a tool that makes applying functions easy (step-by-step)Ex. The tibble would be piped in as the input the
glimpse()
function- Tip1: read pip as “AND THEN” when reading the code
- Tip2: use CMD + SHIFT + M to make a pipe