Rev. | 99d97bde6db7c8945ce8b535b76b07d3a321c020 |
---|---|
大小 | 27,803 字节 |
时间 | 2018-04-26 05:38:19 |
作者 | Lorenzo Isella |
Log Message | I did some extra work on the presentation. |
\documentclass[12pt]{beamer}
\usepackage{graphicx}
% \usepackage[T1]{fontenc}
\usepackage{emerald}
\usepackage{tikz}
% \usepackage{cprotect}
\usepackage{listings}
\lstset{breaklines=true}
% <<setup, include=FALSE>>=
% library(knitr)
% render_listings()
% @
\usetheme{default}
\beamertemplatenavigationsymbolsempty
\hypersetup{pdfpagemode=UseNone} % don't show bookmarks on initial view
% named colors
\definecolor{offwhite}{RGB}{249,242,215}
\definecolor{foreground}{RGB}{255,255,255}
\definecolor{background}{RGB}{24,24,24}
\definecolor{title}{RGB}{107,174,214}
\definecolor{gray}{RGB}{155,155,155}
\definecolor{subtitle}{RGB}{102,255,204}
\definecolor{hilight}{RGB}{102,255,204}
\definecolor{vhilight}{RGB}{255,111,207}
\definecolor{lolight}{RGB}{155,155,155}
%\definecolor{green}{RGB}{125,250,125}
% use those colors
\setbeamercolor{titlelike}{fg=title}
\setbeamercolor{subtitle}{fg=subtitle}
\setbeamercolor{institute}{fg=gray}
\setbeamercolor{normal text}{fg=foreground,bg=background}
\setbeamercolor{item}{fg=foreground} % color of bullets
\setbeamercolor{subitem}{fg=gray}
\setbeamercolor{itemize/enumerate subbody}{fg=gray}
\setbeamertemplate{itemize subitem}{{\textendash}}
\setbeamerfont{itemize/enumerate subbody}{size=\footnotesize}
\setbeamerfont{itemize/enumerate subitem}{size=\footnotesize}
%% Grey (gray) Background Colour
\setbeamercolor{background canvas}{bg=gray!30!black}
%%%%Uncomment the part in the frame if you want the blackboard effect
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Random Dust Trails
%\pgfmathsetseed{\number\pdfrandomseed} % seed for random generator
% \setbeamertemplate{background}{
% \begin{tikzpicture}
% \useasboundingbox (0,0) rectangle (\the\paperwidth, \the\paperheight);
% \foreach \i in {1,...,30} {
% \pgfmathsetmacro{\x}{random(0,10000)/5000-1}%
% \pgfmathsetmacro{\y}{random(0,10000)/10000-0.1}%
% \pgfmathsetmacro{\r}{random(0,10000)/1000-5}%
% \rotatebox{\r}{
% \pgftext[at=\pgfpoint{\x\paperwidth}{\y\paperheight}, left, base]{\includegraphics[width=\textwidth]{paintstroke.png}}
% }
% };
% \end{tikzpicture}
% }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%% Uncomment the part in the frame if you want to use the Augie font
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Now try to set the Augie font everywhere
% \setbeamerfont{framesubtitle}{series=\ECFAugie}
% \setbeamerfont{title}{series=\ECFAugie}
% \setbeamerfont{caption}{series=\ECFAugie}
% \setbeamerfont{author}{series=\ECFAugie}
% \setbeamerfont{institute}{series=\ECFAugie}
% \setbeamerfont{date}{series=\ECFAugie}
% \setbeamerfont{frametitle}{series=\ECFAugie}
% \setbeamerfont{item}{series=\ECFAugie}
% %% use a small dash ('-') for a bulletpoint list
% \setbeamertemplate{itemize item}{\usebeamercolor[fg]{item}\small\ECFAugie{-}}
% see https://tex.stackexchange.com/questions/320223/how-to-enforce-a-font-series-in-beamer-for-normal-default-text/320244
% \setbeamerfont{normal text}{series= \ECFAugie}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% page number
\setbeamertemplate{footline}{%
\raisebox{5pt}{\makebox[\paperwidth]{\hfill\makebox[20pt]{\color{gray}
\scriptsize\insertframenumber}}}\hspace*{5pt}}
% add a bit of space at the top of the notes page
\addtobeamertemplate{note page}{\setlength{\parskip}{12pt}}
% a few macros
\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}
\newcommand{\ig}{\includegraphics}
\newcommand{\subt}[1]{{\footnotesize \color{subtitle} {#1}}}
% Compile with Rscript -e "library(knitr); knit('./R-intro-code.Rnw')"
\title{Introduction to R (with Hands on Applications!)}
\framesubtitle{A researcher's perspective}
\author{ {Lorenzo Isella}}
\institute{DG TRADE, G2, Chief Economist Team}
\AtBeginDocument{\usebeamerfont{normal text}}
\begin{document}
\frame{
\titlepage}
\begin{frame}
\frametitle{Harsh Reality}
% \framesubtitle{Test Frame}
% \subt{An optional subtitle}
By the end of this training you will \underline{not}
\begin{itemize}
\item be a statistician/data analyst
\item be an extremely proficient R user
\item dump Excel for good.
\end{itemize}
You do not become an expert at using any non-trivial tool in 10
hours.
So what can you expect to get from this training?
\end{frame}
\begin{frame}
\frametitle{What to Expect from this Training}
% \framesubtitle{Test Frame}
% \subt{An optional subtitle}
On the other hand, by the end of this training you will
\begin{itemize}
\item know there is a tool able to make it easier to repeat simple tedious and error-prone data tasks
\item know that data analytics is not about typing a handful of fancy
excel commands
\item know that you are not alone in your data struggle. Someone else
most likely had the same issue tormenting you. With a bit of luck,
she has already coded in R the solution you need!
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Overview of the Training}
\begin{itemize}
\item Philosophy of the training: your goal is to get better,
faster and more productive at data analysis.
\item you are not interested in the 6 different kinds of atomic
vectors in R.
\item So we will go head over heels on the basics and
\item plunge into the tidyverse. Tidyverse is a collection of tools
for powerful and expressive data analysis and visualisation.
\item we will barely scratch the surface of many topics, but you
will have an idea of the state-of-the art R for data mining.
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{R and Statistical Computing}
R is a statistical environment bringing you
\begin{itemize}
\item an effective data handling and storage facility,
\item a suite of operators for calculations on arrays, in particular matrices,
\item a large, coherent, integrated collection of intermediate tools for data analysis,
\item graphical facilities for data analysis and display either directly at the computer or on hard-
copy, and
\item a well developed, simple and effective programming language
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Applications of R to your Daily Workflow}
\begin{itemize}
\item R is free and cross-platform. It runs on your Windows, MAC and
Linux machine. No fees no trial periods. Visit {\url{https://www.r-project.org/}}
\item R can be used to analyse your data and
produce \underline{publication-quality} visualisations.
\item R is extended by hundreds of high quality packages often
developed by leading specialists in their field.
\item R has a large user base ($>$ 1.000.000 users) and it is
\emph{de facto} a lingua franca for computational statistics.
% \item R runs on Windows, Linux and MAC computers and...it is all
% free!
\end{itemize}
\end{frame}
\begin{frame}[fragile]
\frametitle{Basic Operations in R}
At the very least, you can use R as a calculator.
% <<foo, fig.height=4>>=
<<highlight=T>>=
1+1
2/3
@
but there is much more to it.
\end{frame}
\begin{frame}[fragile]
\frametitle{Basic Plotting in R}
One of the strengths of R is the ease of generating good-looking plots
<<my-label, fig.height=3.4, eval=TRUE, dev='png'>>=
set.seed(1213) # for reproducibility
x <- cumsum(rnorm(100))
plot(x, type = 'l') # Brownian motion
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Basic Statistics in R}
You have plenty of in-built functions to calculate your statistics
<<my-label2 , highlight=T, eval=TRUE>>=
set.seed(1213) # for reproducibility
x <- cumsum(rnorm(100))
mean(x)
median(x)
sum(x)
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Advanced Tools for Data Analysis}
We split the dataset of flights in New York City airport into
individual planes and then
summarise each plane by counting the number of flights (count = n())
and computing the average distance (dist = mean(distance, na.rm =
TRUE)) and arrival delay (delay = mean(arr{\verb|_|}delay, na.rm = TRUE)).
<< highlight=T, eval=TRUE,message=F >>=
library(nycflights13)
library(tidyverse)
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
@
This will be made clear later on. Just notice this is almost human-readable.
\end{frame}
\begin{frame}[fragile]
\frametitle{Data Types in R}
The \underline{basic} data types in R are
\begin{itemize}
\item character: "a", "swc"
\item numeric: 2, 15.5
\item integer: 2L (the L tells R to store this as an integer)
\item logical: TRUE, FALSE
\item complex: 1+4i (complex numbers with real and imaginary parts)
\end{itemize}
You can also create your own data types and/or, but we will not discuss this
in these notes. Later on, we will meet the tibbles -- the tidyverse
reinterpretation of the basic R data frames.
\end{frame}
\begin{frame}[fragile]
\frametitle{How Data is Structured in R}
R operates on named data structures
\begin{itemize}
\item vectors
\item lists
\item matrices
\item arrays
\item data frames
\end{itemize}
and in R you can write \emph{functions} to powerfully extend the language.
\end{frame}
\begin{frame}[fragile]
\frametitle{Vectors 1/2}
A vector is a sequence of data elements of the same basic type.
<< eval=TRUE>>=
v1 <- c(2, 3, 5) # numeric values
v2 <- c(TRUE, FALSE, TRUE) # logical values
v3 <- c("aa", "bb", "cc", "dd", "ee") ## strings
@
You can do e.g. arithmetic on numeric vectors
<< eval=TRUE, highlight=FALSE>>=
a <- c(2, 3, 5) # numeric values
b <- c(5, -1, 6) # logical values
a+2
a+b
length(a) # number of elements of a
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Vectors 2/2}
You can join and/or subset vectors and you have facilities to easily
generate some sequences
<< eval=TRUE, highlight=T>>=
a <- c(2, 3, 5)
b <- c(5, -1, 6)
c(a,b)
a[2:3]
seq(2, 8, by= 2)
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Lists 1/3}
A list generalises the idea of a vector. It can hold items of
different types. The name tag is optional
\vspace*{-0.2cm}
<< eval=TRUE, highlight=F>>=
Lst <- list(name="Fred", wife="Mary",
no.children=3,child.ages=c(4,7,9))
Lst
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Lists 2/3}
List size can be increased on the fly. List contents can be accessed either by index or by name
<< eval=TRUE, highlight=F>>=
Lst$name
Lst[[1]]
Lst[1]
@
Note the difference between and $[[\cdots]]$ (extracts an element
from a list, drops the name tag) $[\cdots]$ (creates a sublist, keeps
name tag).
\end{frame}
\begin{frame}[fragile]
\frametitle{Lists 3/3}
Lists can be concatenated and increased on the fly
\vspace*{-0.3cm}
<< eval=TRUE, highlight=F>>=
ls1 <- list("aa", 2.3)
ls2 <- list("bb", 4.5)
ls1 <- c(ls1, ls2)
ls1
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Matrices and Arrays 1/3}
An array can be considered as a multiply subscripted collection of data entries, for example
numeric. A matrix is a 2-dimensional array, but it is such an important
special case that R contains many operators and functions that are
available only for matrices.
<< eval=TRUE, highlight=F>>=
x <- array(1:4, dim=c(2,2))
x
y <- matrix(1:4, 2,2)
y
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Matrices and Arrays 2/3}
We can slice a matrix by selecting its columns/rows or a single entry
<< eval=TRUE, highlight=F>>=
z <- matrix(5:8, 2,2)
z
z[2,]
z[ , 1]
z[2,1]
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Matrices and Arrays 3/3}
We can join matrices by rows and columns
<< eval=TRUE, highlight=F>>=
cbind(y,z)
rbind(y,z)
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Data Frames 1/2}
Data frames are similar to tables in data bases. Each column holds the
same type, and the columns can have header names. A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.)
<< eval=TRUE, highlight=T>>=
people = c("Alex", "Barb", "Carl") # col 1
ages = c(19, 29, 39) # col 2
df = data.frame(people, ages) # create
names(df) = c("NAME", "AGE") # headers
df
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Data Frames 2/2}
We can slice a data frame like a matrix or also select its columns by name
<< eval=TRUE, highlight=T>>=
df[ ,1]
df$NAME
@
Internally, R sees a data frame as a list with class ``data.frame''.
\end{frame}
\begin{frame}[fragile]
\frametitle{Mutability of Data Structures}
Of course all the data structures in R can be altered. We use ``='' or
``\verb|<-|'' to assign values.
See for instance
<< eval=TRUE, highlight=T>>=
x <- c(1,2,3)
x[2] <- -4
x
#and sometimes the puzzling
y =2
y= y +7 # new y = old y +7
y
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Mutability of Data Structures -- Small Caveat}
We saw that ``='' can be used to assign a value. Instead, ``==''
is a \underline{logical} operator that checks if
two values/objects are identical.
See for instance
<< eval=TRUE, highlight=T>>=
x = 2
x
x == 2
x == 3
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Functions in R 1/2}
A function is defined by an assignment of the form
<< eval=F, highlight=T >>=
name <- function(arg_1, arg_2, …) expression
@
The expression is an R expression that uses the arguments, arg\verb|_|i, to calculate a value. The value of the expression is the value returned for the function.
mean(), sum(), cumsum(), c(), are examples of R in-built functions we have
already met.
\end{frame}
\begin{frame}[fragile]
\frametitle{Functions in R 2/2}
Example functions of one and two variables.
<< eval=T, highlight=T >>=
double <- function(x){ x*2}
double_and_triple <- function(x,y) {c(x*2, y*3) }
a <-7
b <- 5
double(a)
double_and_triple(a,b)
@
\end{frame}
% \begin{frame}[fragile]
% \frametitle{Functions in R 3/2}
% A technical remark: functions do \underline{not} modify their own arguments
% \end{frame}
\begin{frame}[fragile]
\frametitle{Data Input and Output in R}
\begin{itemize}
\item R provides a number of facilities to import external data in different
formats (csv file, excel workbook, SQL data base, STATA dat file, etc...).
\item I personally work most of the time with csv files, which can be
input/output by Excel. For importing and manipulating data, I recommend the
tidyverse library.
\end{itemize}
<< eval=F, highlight=T>>=
library(tidyverse)
# read data
mydata<-read_csv("filename.csv")
# write data
write_csv(mydata, "my_output_data.csv")
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Long Computations in R}
R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand.
<< eval=T, highlight=T>>=
## generate some arbitrary data
x<-c(1e4, 1.1e4, 2.3e4, 1.8e4,7e4,4.1e4)
# Compute the logarithm of `x`, return suitably
# lagged and iterated differences,
# compute the exponential function
# and round the result
round(exp(diff(log(x))), 1)
@
\end{frame}
% \begin{frame}[fragile]
% \frametitle{Long Computations in R}
% Computations can often result in expressions which are hard to read.
% << eval=T, highlight=T>>=
% ## generate some arbitrary data
% x<-c(1e4, 1.1e4, 2.3e4, 1.8e4,7e4,4.1e4)
% # Compute the logarithm of `x`, return suitably
% # lagged and iterated differences,
% # compute the exponential function
% # and round the result
% round(exp(diff(log(x))), 1)
% @
% Wouldn't it be nice to have a way to express these operations which is
% easy to read and understand?
% \end{frame}
\begin{frame}[fragile]
\frametitle{Enters the Pipe Operator}
The pipe operator \verb|%>%| has two fundamental properties
\begin{enumerate}
\item Function $f(x)$ can be rewritten as $x$ \verb|%>%| $f$
<< eval=T, highlight=F >>=
x <- 10
# Compute the logarithm of `x`
log(x)
x %>% log()
@
\item Function $f(x, y)$ can be rewritten as $x$ \verb|%>%| $f(y)$
<< eval=T, highlight=F >>=
# Round pi
round(pi, 6)
pi %>% round(6)
@
\end{enumerate}
\end{frame}
\begin{frame}[fragile]
\frametitle{Why was This Invented at All?}
The pipe operator \verb|%>%| provides you with a number of benefits
\begin{enumerate}
\item You'll structure the sequence of your data operations from left to right, as apposed to from inside and out;
\item You'll avoid nested function calls;
\item You'll minimize the need for local variables and function definitions; And
\item You'll make it easy to add steps anywhere in the sequence of operations.
\end{enumerate}
<< eval=F, highlight=T >>=
log(sin(sqrt(x))) # becomes
x %>% sqrt() %>%
sin() %>%
log() #much easier to follow!
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Application to the Previous Example}
This sounds very abstract, but let us see \verb|%>%| in action
<< eval=T, highlight=T>>=
library(tidyverse)
x<-c(1e4, 1.1e4, 2.3e4, 1.8e4,7e4,4.1e4)
x %>% log() %>%
diff() %>%
exp() %>%
round(1)
@
Now you finally understand what is going on. Cleaner code is easier to
share and extend.
\end{frame}
\begin{frame}[fragile]
\frametitle{Modify a Sequence of Computations}
Now that the operations are laid out as a sequence, it is much easier to modify them whenever we need to. For instance
<< eval=T, highlight=T>>=
# Compute the logarithm of `x`, return suitably
# lagged and iterated differences,
# compute the mean
# and round the result with two digits
library(tidyverse)
x %>% log() %>%
diff() %>%
mean() %>%
round(2)
@
\end{frame}
\begin{frame}[fragile]
\frametitle{Tidyverse and R}
\begin{itemize}
\item R is extended by packages, i.e. collections of tools/functions
for a variety of purposes
\item The tidyverse (\url{https://www.tidyverse.org/}) is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
\item Personal opinion: it will take you some time to understand
the tidyverse, but then you will never look back.
\end{itemize}
\end{frame}
\begin{frame}[fragile]
\frametitle{dplyr -- Data Manipulation 1/2}
dplyr (part of the tidyverse family) is a \underline{grammar of data manipulation}.
When working with data you must
\begin{itemize}
\item Figure out what you want to do.
\item Describe those tasks in the form of a computer program.
\item Execute the program.
\end{itemize}
The dplyr package makes these steps fast and easy
\begin{itemize}
\item By constraining your options, it helps you think about your data manipulation challenges.
\item It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
\item It uses efficient backends, so you spend less time waiting for the computer.
\end{itemize}
% filter() to select cases based on their values.
% arrange() to reorder the cases.
% select() and rename() to select variables based on their names.
% mutate() and transmute() to add new variables that are functions of existing variables.
% summarise() to condense multiple values to a single value.
% sample_n() and sample_frac() to take random samples.
\end{frame}
\begin{frame}[fragile]
\frametitle{dplyr -- Data Manipulation 2/2}
dplyr is a grammar because it provides verbs that help you solve the most common data manipulation challenges:
\begin{itemize}
\item mutate() adds new variables that are functions of existing variables
\item select() picks variables based on their names.
\item filter() picks cases based on their values.
\item summarise() reduces multiple values down to a single summary.
\item arrange() changes the ordering of the rows.
\item group\verb|_|by() which allows you to perform any operation ``by group''.
\end{itemize}
This works beautifully with the pipe operator.
\end{frame}
\begin{frame}[fragile]
\frametitle{Example with Balance of Payment Data}
<< highlight=T, eval=TRUE,message=F, warning=F >>=
library(tidyverse)
df<-read_csv("bop_flow2.csv") %>%
{.$Value=as.numeric(.$Value)
.}
@
Let us glimpse at the resulting table (only a few lines are shown)
% df<-read_csv("bop_flow2.csv",col_types = cols(Value = "i"))
\begin{table}[ht]
\centering
\scalebox{0.7}{
\begin{tabular}{rlllll}
\hline
TIME & GEO & CURRENCY & NACE\_R2 & STK\_FLOW & STK\_FLOW\_LABEL \\
\hline
2016 & EU28 & Million euro & TOTAL & ASS & Assets \\
2016 & EU28 & Million euro & TOTAL & ASS & Assets \\
2016 & EU28 & Million euro & TOTAL & ASS & Assets \\
\hline
\end{tabular}
}
\end{table}
\begin{table}[ht]
\centering
\scalebox{0.7}{
\begin{tabular}{rllllr}
\hline
TIME & ENTITY & FDI\_ITEM & FDI\_ITEM\_LABEL & PARTNER & Value \\
\hline
2016 & TOTAL & DO\_\_D\_\_F & Direct investment abroad (DIA) & CH & NA \\
2016 & TOTAL & DO\_\_D\_\_F & Direct investment abroad (DIA) & TR & NA \\
2016 & TOTAL & DO\_\_D\_\_F & Direct investment abroad (DIA) & RU & NA \\
\hline
\end{tabular}
}
\end{table}
\end{frame}
\begin{frame}[fragile]
\frametitle{dplyr Verbs in Action 1/4}
In 2015, how many million euros did the EU28 (GEO) invest
(FDI\verb|_|ITEM is DO\verb|_|\verb|_|D\verb|_|\verb|_|F; ENTITY is TOTAL) in manufacture
(NACE\verb|_|R2 is C) in Japan (PARTNER is JP) as outward net foreign
direct investment (STK\verb|_|FLOW is NO)?
<< highlight=T, eval=TRUE,message=F >>=
library(tidyverse)
manu_JP <- df %>%filter(TIME==2015, GEO=="EU28",
STK_FLOW=="NO",FDI_ITEM=="DO__D__F",
ENTITY=="TOTAL",PARTNER=="JP", NACE_R2=="C") %>%
select(TIME, GEO, PARTNER, NACE_R2, Value)
manu_JP
@
\end{frame}
\begin{frame}[fragile]
\frametitle{dplyr Verbs in Action 2/4}
And the total FDI to the US for all years
<< highlight=T, eval=TRUE,message=F >>=
library(tidyverse)
FDI_US <- df %>%filter( GEO=="EU28",
STK_FLOW=="NO",FDI_ITEM=="DO__D__F",
ENTITY=="TOTAL",PARTNER =="US",NACE_R2=="FDI") %>%
select(TIME, GEO, PARTNER, NACE_R2, Value)
FDI_US
@
\end{frame}
\begin{frame}[fragile]
\frametitle{dplyr Verbs in Action 3/4}
And if you want the average FDI to the US along the years
<< highlight=T, eval=TRUE,message=F >>=
library(tidyverse)
FDI_US_mean <- df %>%filter( GEO=="EU28",
STK_FLOW=="NO",FDI_ITEM=="DO__D__F",
ENTITY=="TOTAL",PARTNER =="US", NACE_R2=="FDI")%>%
select(TIME, GEO, PARTNER, NACE_R2, Value) %>%
summarise(mean_FDI_to_US=mean(Value))
FDI_US_mean
@
\end{frame}
\begin{frame}[fragile]
\frametitle{dplyr Verbs in Action 4/4}
Now you want to do the same for US and India in one go
\vspace*{-0.2cm}
<< highlight=T, eval=TRUE,message=F >>=
library(tidyverse)
FDI_US_IN <- df %>%filter( GEO=="EU28",
STK_FLOW=="NO",FDI_ITEM=="DO__D__F",
ENTITY=="TOTAL",PARTNER %in% c("US", "IN"),
NACE_R2=="FDI")%>%
select(TIME, GEO, PARTNER, NACE_R2, Value) %>%
group_by(PARTNER) %>%
summarise(mean_FDI=mean(Value))
FDI_US_IN
@
\end{frame}
\begin{frame}[fragile]
\frametitle{dplyr -- Final Thoughts}
\begin{itemize}
\item we barely scratched the surface of dplyr
\item but we have already seen filter, selection of columns and
computing statistics on groups of variables
\item thanks to the pipe operator, most of the code that you write
is reusable and readable
\item you do not worry about cells, indexes etc..., but you think
more about the questions you want to pose to your data.
\end{itemize}
\end{frame}
\begin{frame}[fragile]
\frametitle{Tidy Data}
The tidyverse is named after the tidy data format. In tidy data
\begin{enumerate}
\item Each variable forms a column.
\item Each observation forms a row.
\item Each type of observational unit forms a table.
\end{enumerate}
Tidy data makes it easy for an analyst or a computer to extract needed
variables because it provides a standard way of structuring a
dataset. You do not need different strategies to extract different variables.
The FDI flow data set was cast in a tidy format.
Every time you have a data set with the year on the horizontal axis,
you are sure that the data set is messy (not tidy).
\end{frame}
\begin{frame}[fragile]
\frametitle{Tidying Messy Datasets}
Real data sets are often messy in every conceivable way, e.g.
\begin{itemize}
\item Column headers are values, not variable names.
\item Multiple variables are stored in one column.
\item Variables are stored in both rows and columns.
\item Multiple types of observational units are stored in the same table.
\item A single observational unit is stored in multiple tables.
\end{itemize}
Tidying messy data sets is in itself a large topic; we'll focus only
on one example in the following.
\end{frame}
\begin{frame}[fragile]
\frametitle{Column headers are values, not variable names}
This is one of the most common cases. See for instance some data about
income and religion in the US
<< highlight=T, eval=TRUE,message=F >>=
library(tidyverse)
pew <-read_csv("income_religion.csv")
@
\begin{table}[ht]
\centering
\scalebox{0.7}{
\begin{tabular}{lrrrrrr}
\hline
religion & $<$\$10k & \$10-20k & \$20-30k & \$30-40k & \$40-50k & \$$>$50k \\
\hline
Agnostic & 27 & 34 & 60 & 81 & 76 & 137 \\
Atheist & 12 & 27 & 37 & 52 & 35 & 70 \\
Buddhist & 27 & 21 & 30 & 34 & 33 & 58 \\
Catholic & 418 & 617 & 732 & 670 & 638 & 1116 \\
Don’t know & 15 & 14 & 15 & 11 & 10 & 35 \\
Evangelical & 575 & 869 & 1064 & 982 & 881 & 1486 \\
Hindu & 1 & 9 & 7 & 9 & 11 & 34 \\
Historically black & 228 & 244 & 236 & 238 & 197 & 223 \\
Jehovah's withnesses & 20 & 27 & 24 & 24 & 21 & 30 \\
Jewish & 19 & 19 & 25 & 25 & 30 & 95 \\
\hline
\end{tabular}
}
\end{table}
This dataset has three variables, religion, income class and frequency.
Religion and income class are non-variable columns, whereas the
frequency is the only value.
\end{frame}
\begin{frame}[fragile]
\frametitle{Tidying the Data Set 1/}
To tidy the pew data set, we need to \underline{gather} the non-variable columns into a two-column key-value pair.
\end{frame}
\begin{frame}[fragile]
\frametitle{Linear Models in R}
\begin{itemize}
\item R has everything you need for sophisticated statistical models (linear
models, random forests, kernel methods, etc...)
\item however linear models are (ab)used almost everywhere due to
their ease of implementation
\item in R their implementation is as simple as
<< eval=F, highlight=T>>=
my_lin_model<-lm(y~x1+x2+x3, data=mydata)
@
\end{itemize}
\end{frame}
\end{document}