- Introduction
- Syllabus
- Assignments
- Homework
- Labs
- Data Project
- Final exam
- Meetup Presentation
- The
DATA606
R Package - Using R Markdown
- Intro to Data (Chapter 1)
January 31, 2018
DATA606
R PackageA little about me:
Syllabus and course materials are here: http://data606.net
We will use Blackboard to submit assignments.
I would like to use Github's Issue tracker for course discussions (this is my first semester trying this, so tell me how it goes).
Please submit PDF files and if you used Rmarkdown, the Rmd file too.
See http://data606.net/schedule/ for up-to-date calendar.
Start | Due Date | Chapter | Topic |
---|---|---|---|
Jan-29 | Feb-11 | 1 | Intro to Data |
Feb-12 | Feb-18 | 2 | Probability |
Feb-19 | Mar-4 | 3 | Distributions |
Mar-5 | Mar-18 | 4 | Foundation for Inference |
Mar-19 | Mar-25 | 5 | Inference for Numerical Data |
Mar-26 | Apr-8 | 6 | Inference for Categorical Data |
Apr-9 | Apr-22 | 7 | Linear Regression |
Apr-23 | May-6 | 8 | Multiple & Logistic Regression |
May-7 | May-16 | Navarro | Introduction to Bayesian Analysis |
May-17 | May-21 | Final Exam |
DATA606
package: https://github.com/jbryer/DATA606Spring2018/issuesDATA606
R PackageThe package can be installed from Github using the devtools
package.
devtools::install_github('jbryer/DATA606')
library('DATA606')
- Load the packagevignette(package='DATA606')
- Lists vignettes in the DATA606 packagevignette('os3')
- Loads a PDF of the OpenIntro Statistics bookdata(package='DATA606')
- Lists data available in the packagegetLabs()
- Returns a list of the available labsviewLab('Lab0')
- Opens Lab0 in the default web browserstartLab('Lab0')
- Starts Lab0 (copies to getwd()), opens the Rmd fileshiny_demo()
- Lists available Shiny appsR Markdown files are provided for all the labs. You can start a lab using the DATA606::startLab
function.
However, creating new R Markdown files in RStudio can be done by clicking File
> New File
> R Markdown
.
When working with files in R, there are two ways to specify paths: 1. Using absolute paths (i.e. starting with C:/
or /
on Windows and Mac/Lunix, respectively), or relative paths (possibly without any directory information). When working with the latter, where R looks will be based upon the working directory. You can get the working directory with getwd()
and set the working directory with setwd()
. In RStudio, you can also set the working directory on the Files
tab by clicking More
, then Set as Working Directory
.
We will use the lego
R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.
devtools::install_github("seankross/lego")
library(lego) data(legosets)
str(legosets)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6172 obs. of 14 variables: ## $ Item_Number : chr "10246" "10247" "10248" "10249" ... ## $ Name : chr "Detective's Office" "Ferris Wheel" "Ferrari F40" "Toy Shop" ... ## $ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## $ Theme : chr "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ... ## $ Subtheme : chr "Modular Buildings" "Fairground" "Vehicles" "Winter Village" ... ## $ Pieces : int 2262 2464 1158 898 13 39 32 105 13 11 ... ## $ Minifigures : int 6 10 NA NA 1 2 2 3 2 2 ... ## $ Image_URL : chr "http://images.brickset.com/sets/images/10246-1.jpg" "http://images.brickset.com/sets/images/10247-1.jpg" "http://images.brickset.com/sets/images/10248-1.jpg" "http://images.brickset.com/sets/images/10249-1.jpg" ... ## $ GBP_MSRP : num 132.99 149.99 69.99 59.99 9.99 ... ## $ USD_MSRP : num 159.99 199.99 99.99 79.99 9.99 ... ## $ CAD_MSRP : num 200 230 120 NA 13 ... ## $ EUR_MSRP : num 149.99 179.99 89.99 69.99 9.99 ... ## $ Packaging : chr "Box" "Box" "Box" "Box" ... ## $ Availability: chr "Retail - limited" "Retail - limited" "LEGO exclusive" "LEGO exclusive" ...
Descriptive statistics:
Plot types:
table(legosets$Availability, useNA='ifany')
## ## LEGO exclusive LEGOLAND exclusive Not specified ## 695 2 1795 ## Promotional Promotional (Airline) Retail ## 141 12 3120 ## Retail - limited Unknown ## 403 4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
## ## Blister pack Box Box with backing card Bucket ## LEGO exclusive 45 147 0 1 ## LEGOLAND exclusive 0 2 0 0 ## Not specified 0 20 0 0 ## Promotional 0 44 0 0 ## Promotional (Airline) 0 11 0 0 ## Retail 53 2575 16 30 ## Retail - limited 2 302 1 5 ## Unknown 0 1 0 0 ## ## Canister Foil pack Loose Parts Not specified Other ## LEGO exclusive 0 0 71 7 5 ## LEGOLAND exclusive 0 0 0 0 0 ## Not specified 0 5 0 1739 0 ## Promotional 0 0 1 0 3 ## Promotional (Airline) 0 0 0 1 0 ## Retail 78 285 0 0 28 ## Retail - limited 0 1 0 0 0 ## Unknown 0 0 0 0 0 ## ## Plastic box Polybag Shrink-wrapped Tag Tub ## LEGO exclusive 1 412 0 6 0 ## LEGOLAND exclusive 0 0 0 0 0 ## Not specified 6 24 0 0 1 ## Promotional 2 90 0 0 1 ## Promotional (Airline) 0 0 0 0 0 ## Retail 0 4 18 0 33 ## Retail - limited 1 86 0 0 5 ## Unknown 0 3 0 0 0
prop.table(table(legosets$Availability))
## ## LEGO exclusive LEGOLAND exclusive Not specified ## 0.1126053143 0.0003240441 0.2908295528 ## Promotional Promotional (Airline) Retail ## 0.0228451069 0.0019442644 0.5055087492 ## Retail - limited Unknown ## 0.0652948801 0.0006480881
barplot(table(legosets$Availability), las=3)
barplot(prop.table(table(legosets$Availability)), las=3)
library(vcd) mosaic(HairEyeColor, shade=TRUE, legend=TRUE)
Descriptive statistics:
Plot types:
mean(legosets$Pieces, na.rm=TRUE)
## [1] 215.1686
median(legosets$Pieces, na.rm=TRUE)
## [1] 82
var(legosets$Pieces, na.rm=TRUE)
## [1] 126876.8
sqrt(var(legosets$Pieces, na.rm=TRUE))
## [1] 356.1976
sd(legosets$Pieces, na.rm=TRUE)
## [1] 356.1976
fivenum(legosets$Pieces, na.rm=TRUE)
## [1] 0.0 30.0 82.0 256.5 5922.0
IQR(legosets$Pieces, na.rm=TRUE)
## [1] 226.25
summary
Functionsummary(legosets$Pieces)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0 30.0 82.0 215.2 256.2 5922.0 112
psych
Packagelibrary(psych) describe(legosets$Pieces, skew=FALSE)
## vars n mean sd min max range se ## X1 1 6060 215.17 356.2 0 5922 5922 4.58
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
## item group1 vars n mean sd min max ## X11 1 LEGO exclusive 1 659 172.74203 442.96954 1 3428 ## X12 2 LEGOLAND exclusive 1 2 211.00000 154.14928 102 320 ## X13 3 Not specified 1 1747 145.87178 309.19929 1 5195 ## X14 4 Promotional 1 140 53.97143 108.42721 1 1000 ## X15 5 Promotional (Airline) 1 12 126.16667 47.01612 10 203 ## X16 6 Retail 1 3094 245.78119 294.78052 0 3803 ## X17 7 Retail - limited 1 402 410.94030 652.06435 1 5922 ## X18 8 Unknown 1 4 27.50000 15.96872 6 44 ## range se ## X11 3427 17.255643 ## X12 218 109.000000 ## X13 5194 7.397620 ## X14 999 9.163772 ## X15 193 13.572384 ## X16 3803 5.299546 ## X17 5921 32.522014 ## X18 38 7.984360
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
stripchart(legosets$Pieces)
par.orig <- par(mar=c(1,10,1,1)) stripchart(legosets$Pieces ~ legosets$Availability, las=1)
par(par.orig)
hist(legosets$Pieces)
With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.
hist(log(legosets$Pieces))
plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')
plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')
boxplot(legosets$Pieces)
boxplot(log(legosets$Pieces))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z ## $group == : Outlier (-Inf) in boxplot 1 is not drawn
plot(legosets$Pieces, legosets$USD_MSRP)
legosets[which(legosets$USD_MSRP >= 400),]
## # A tibble: 4 x 14 ## Item_Number Name Year Theme ## <chr> <chr> <int> <chr> ## 1 2000430 Identity and Landscape Kit 2013 Serious Play ## 2 2000431 Connections Kit 2013 Serious Play ## 3 2000409 Window Exploration Bag 2010 Serious Play ## 4 10179 Ultimate Collector's Millennium Falcon 2007 Star Wars ## # ... with 10 more variables: Subtheme <chr>, Pieces <int>, ## # Minifigures <int>, Image_URL <chr>, GBP_MSRP <dbl>, USD_MSRP <dbl>, ## # CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>
legosets[which(legosets$Pieces >= 4000),]
## # A tibble: 4 x 14 ## Item_Number Name Year Theme ## <chr> <chr> <int> <chr> ## 1 10214 Tower Bridge 2010 Advanced Models ## 2 2000409 Window Exploration Bag 2010 Serious Play ## 3 10189 Taj Mahal 2008 Advanced Models ## 4 10179 Ultimate Collector's Millennium Falcon 2007 Star Wars ## # ... with 10 more variables: Subtheme <chr>, Pieces <int>, ## # Minifigures <int>, Image_URL <chr>, GBP_MSRP <dbl>, USD_MSRP <dbl>, ## # CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>
plot(legosets$Pieces, legosets$USD_MSRP) bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),] text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)
There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.
There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.
"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"John Tukey