Ch1: General Background in R

1. R 物件與索引技術介紹

1.1. 變數的賦值與基本型態

R 語言中變數的賦值方式有兩種，1. <- (箭號) 以及 2. = (等號)，根據 Google’s R Style Guide 建議使用 <- 進行變數賦值。另外，R 語言中程式的註解以 # (井號) 表示。

R 物件最基本的單位是向量 (vector)，以 c() 表示 (c 取自combine之意)，元素以逗號分隔。其中向量包含三種基本類別(class)，即 1.數值向量 (numeric vector)、2. 字串向量 (character vector) 以及 3. 布林向量 (logical vector)。

# numeric vector
x <- c(4.39, 2.11, 3.17)
x

[1] 4.39 2.11 3.17

class(x) # 可利用class(x)查詢物件的類別

[1] "numeric"

# character vector，以雙引號 " " 或單引號 ' ' 註記 
y <- c("apple", "book", "cat")
y

[1] "apple" "book"  "cat"

class(y)

[1] "character"

# logical vector，以TRUE / FALSE 註記，亦可簡化為 T / F
z <- c(TRUE, FALSE, TRUE)
z

[1]  TRUE FALSE  TRUE

class(z)

[1] "logical"

1.2. 向量的類別轉換

向量物件一次只能接受一種類別，若同時在一個向量中給定多種類別，R 將會依以下順序進行自動轉換：字串 > 數值 > 布林。也可以利用以下函數自行轉換向量的類別：as.character, as.numeric, as.logical。

# 向量只容許一種類別 (字串 > 數值 > 布林)
c(1, 2, "three") # 數值被轉換成字串

[1] "1"     "2"     "three"

c(1, 2, TRUE, FALSE) # 布林值 TRUE 被轉換成1，FALSE被轉換成0

[1] 1 2 1 0

c(1.1, 2.4, TRUE, FALSE)

[1] 1.1 2.4 1.0 0.0

c("one", 2.4, TRUE) # 所有元素都被轉換成字串

[1] "one"  "2.4"  "TRUE"

# 字串轉數字
a1 <- c("89", "91", "102")
as.numeric(a1)

[1]  89  91 102

# 布林轉數字
a2 <- c(TRUE, TRUE, FALSE)
as.numeric(a2)

[1] 1 1 0

# 數字轉布林
a3 <- c(-2, -1, 0, 1, 2)
as.logical(a3)

[1]  TRUE  TRUE FALSE  TRUE  TRUE

# 數字轉字串
as.character(a3)

[1] "-2" "-1" "0"  "1"  "2"

1.3. 向量物件的簡記與recycling properties

連續整數向量可以利用 : (冒號) 簡記，譬如數列1, 2, 3 在 R 語言中可利用 1:3 簡記。R 中的向量具有 recycling properties，便於執行四則運算。

# basic expression of integer vector
c(1, 2, 3)

[1] 1 2 3

# simple expression
1:3

[1] 1 2 3

3:1

[1] 3 2 1

# shorter arguments are recycled
1:3 * 2

[1] 2 4 6

1:4 + 1:2

[1] 2 4 4 6

c(0.5, 1.5, 2.5, 3.5) * c(2, 1)

[1] 1.0 1.5 5.0 3.5

# warning (why?)
1:3 * 1:2

Warning: 較長的物件長度並非較短物件長度的倍數

[1] 1 4 3

1.4. 向量元素的命名

在 R 語言中，可以對向量中的每一個元素命名，或者是利用函數 names 對向量元素命名，這有助於該向量的理解。

y <- c("apple", "book", "cat")
y

[1] "apple" "book"  "cat"

y1 <- c(A="apple", B="book", C="cat")
# 等價於 
# y1 <- y
# names(y1) <- c("A", "B", "C")
y1

      A       B       C 
"apple"  "book"   "cat"

names(y1)

[1] "A" "B" "C"

1.5. 向量的取值與排序

利用[ ] (中括號) 與比較運算子(>, <, >=, <=, ==, !=)、邏輯運算子 (&, |) 以及負號 (-) 進行向量的取值。此外，R也支援利用變數的名稱 (names) 來取值。

# 1st and 3rd elements of vector
x <- c(4.39, 2.11, 3.17)
x[c(1,3)]

[1] 4.39 3.17

x[c(2,3,1)]

[1] 2.11 3.17 4.39

order(x) # 依x各元素大小排序 (由小到大)

[1] 2 3 1

x[order(x)]

[1] 2.11 3.17 4.39

# remove 1st elements of vector
y <- c("apple", "book", "cat")
y[c(-1)]

[1] "book" "cat"

# using comparison and logical operators
x > 3

[1]  TRUE FALSE  TRUE

which(x>3) # which indices are TRUE

[1] 1 3

x[which(x>3)]

[1] 4.39 3.17

x[x > 3] # simplify expression

[1] 4.39 3.17

y[y!="apple"]

[1] "book" "cat"

y1["A"]

      A 
"apple"

y1[y1=="apple"]

      A 
"apple"

names(y1)[y1 == "apple"]

[1] "A"

1.6. 向量元素取代與新增

利用[ ]進行元素的取代與新增

y <- c("apple", "book", "cat")
y[3] <- "car" # replace 3rd element
y

[1] "apple" "book"  "car"

x <- c(4.39, 2.11, 3.17)
x[c(1,3)] <- 0 # replace 1st and 3rd elements to 1
x[4] <- 1.19 # add 4th element to 1.19
# 等價於 c(x, 1.19)
x

[1] 0.00 2.11 0.00 1.19

1.7. data.frame 物件簡介

資料表 (data.frame) 是向量 (vector) 的一種推廣，它可以將多個相同長度 (不一定是相同類別) 的向量合併在一起 (combine by column)。可以利用

x <- c(4.39, 2.11, 3.17)
y <- c("apple", "book", "cat")
z <- c(TRUE, FALSE, TRUE)
df <- data.frame(v1 = x, v2 = y, v3 = z)
df

    v1    v2    v3
1 4.39 apple  TRUE
2 2.11  book FALSE
3 3.17   cat  TRUE

str(df) # 展示物件各欄位的屬性結構 (structure)

'data.frame':   3 obs. of  3 variables:
 $ v1: num  4.39 2.11 3.17
 $ v2: Factor w/ 3 levels "apple","book",..: 1 2 3
 $ v3: logi  TRUE FALSE TRUE

head(df) # 展示物件前6筆資料

    v1    v2    v3
1 4.39 apple  TRUE
2 2.11  book FALSE
3 3.17   cat  TRUE

colnames(df) # 展示物件的欄位名稱

[1] "v1" "v2" "v3"

rownames(df) # 展示物件的列名稱

[1] "1" "2" "3"

1.8. data.frame的取值

利用[,] 提取物件內容，基本表達式為x[i, j]，表示x物件中第i列 (ith row)、第j行 (jth column) 的值，也可用x[i, ]表達第i列的向量；x[,j]表達第j行的向量。中括號中可以使用條件算子進行取值。另外，可以用 $ (錢號) 來提取物件的特定欄位 (column)，請試著在 df$ 之後按tab (自動完成鍵)。

df[1] # select 1st column variable

df[, 1] # select the value of 1st column

[1] 4.39 2.11 3.17

df[, "v1"]

[1] 4.39 2.11 3.17

df$v1

[1] 4.39 2.11 3.17

df[c("v2", "v3")]

     v2    v3
1 apple  TRUE
2  book FALSE
3   cat  TRUE

df[2, ] # select 2nd row

    v1   v2    v3
2 2.11 book FALSE

df[df$v1 > 3 & z==TRUE, "v2"]

[1] apple cat  
Levels: apple book cat

1.9. data.frame的合併

利用rbind (上下合併)、cbind (左右合併) 對data.frame進行合併

x <- data.frame(Drama=c("我的自由年代", "回到愛以前"), 
                TV=c("三立", "台視"))

y <- data.frame(Drama=c("我的自由年代", "回到愛以前"),
                Date=c("2014-02-07", "2014-01-05"),
                Vol=c(12, NA),
                Rating=c(2.67, 2.58))

z <- data.frame(Drama=c("16個夏天", "妹妹"), 
                TV=c("公視", "台視"),
                Date=c("2014-11-01", "2014-10-10"),
                Vol=c(16, 7),
                Rating=c(2.30, 1.30))
x

         Drama   TV
1 我的自由年代 三立
2   回到愛以前 台視

         Drama       Date Vol Rating
1 我的自由年代 2014-02-07  12   2.67
2   回到愛以前 2014-01-05  NA   2.58

     Drama   TV       Date Vol Rating
1 16個夏天 公視 2014-11-01  16    2.3
2     妹妹 台視 2014-10-10   7    1.3

xy <- cbind(x, y[,-1])
rbind(xy, z)

         Drama   TV       Date Vol Rating
1 我的自由年代 三立 2014-02-07  12   2.67
2   回到愛以前 台視 2014-01-05  NA   2.58
3     16個夏天 公視 2014-11-01  16   2.30
4         妹妹 台視 2014-10-10   7   1.30

# 壓縮程式碼 rbind(cbind(x, y[,-1]),z)

1.9. factor 物件簡介

當一向量變數是類別型變數 (categorical data，譬如：性別、教育水準) 時，在R語言中以factor進行定義。

# variable gender with 2 "male" entries and 3 "female" entries 
gender <- c(rep("male",2), rep("female", 3)) 
gender

[1] "male"   "male"   "female" "female" "female"

gender <- factor(gender)
gender

[1] male   male   female female female
Levels: female male

levels(gender)

[1] "female" "male"

as.numeric(gender) # 1=female, 2=male internally (alphabetically)

[1] 2 2 1 1 1

# change vector of labels for the levels
factor(gender, levels=c("male", "female"), labels=c("M", "F"))

[1] M M F F F
Levels: M F

1.10. list 物件簡介

R 環境中最廣義的物件，可以將上述所有物件都包含至同一個物件下。序列 (list) 的表達形式與向量類似，只是每一個元素可以是各種物件型態(vector, data.frame, list, … )。基本的取值方法是[[ ]] (雙層中括號)，x[[i]]表示list物件中第i個值。如過list物件的位置有命名，則可以用$ (錢號) 來提取物件。

L <- list(x = c(1:5), y = c("a", "b", "c"), z = df)
L

$x
[1] 1 2 3 4 5

$y
[1] "a" "b" "c"

$z
    v1    v2    v3
1 4.39 apple  TRUE
2 2.11  book FALSE
3 3.17   cat  TRUE

# teh dollar operator $ or [[]] can be used to retrieve a single element
L[[2]]

[1] "a" "b" "c"

L$y

[1] "a" "b" "c"

L[["z"]]

    v1    v2    v3
1 4.39 apple  TRUE
2 2.11  book FALSE
3 3.17   cat  TRUE

L[3]

$z
    v1    v2    v3
1 4.39 apple  TRUE
2 2.11  book FALSE
3 3.17   cat  TRUE

L[c(1, 3)]

$x
[1] 1 2 3 4 5

$z
    v1    v2    v3
1 4.39 apple  TRUE
2 2.11  book FALSE
3 3.17   cat  TRUE

L[c("x", "y")]

$x
[1] 1 2 3 4 5

$y
[1] "a" "b" "c"

# 序列轉向量
unlist(L)

     x1      x2      x3      x4      x5      y1      y2      y3   z.v11 
    "1"     "2"     "3"     "4"     "5"     "a"     "b"     "c"  "4.39" 
  z.v12   z.v13   z.v21   z.v22   z.v23   z.v31   z.v32   z.v33 
 "2.11"  "3.17"     "1"     "2"     "3"  "TRUE" "FALSE"  "TRUE"

1.11. 特殊變數介紹

NA, not vailable, 通常指遺漏值 (missing value)，可利用 is.na() 函數來判別
numeric(0), length(numeric(0) = 0, 長度為0的數值型物件
Inf, infinity, 無窮大
NaN, not a number, 可利用 is.na() 函數來判別

NA # NA

[1] NA

c(1, NA, 4) + 1

[1]  2 NA  5

x <- c(4.39, 2.11, 3.17)
x[x>5] # numeric(0)

numeric(0)

100/0 # Inf

[1] Inf

-pi/0 #-Inf

[1] -Inf

0/0 # NaN

[1] NaN

Inf-Inf # NaN

[1] NaN

小挑戰

請利用R的取代與索引技術，將物件 x, y 整理成以下表格

text.x <- c("Drama,        TV,   Date,       Vol,  Rating",
            "我的自由年代,  三立, 2014-02-07, 12,   2.67",
            "16個夏天,     公視,      11/01, 16,   2.30",
            "妹妹,         台視, 2014-10-10, 7,    1.30",
            "回到愛以前,    台視, 2014-01-05, NA,   2.58")
x <- read.table(text=text.x,  header = TRUE, sep=",", strip.white=TRUE)

text.y <- c("Drama,       TV,   Date,       variable, value",
            "喜歡.一個人, 三立, 2014-07-04, Vol,      7",
            "喜歡.一個人, 三立, 2014-07-04, Rating,   2.03",
            "徵婚啟事,    台視, 2014-11-28, Vol,      4",
            "徵婚啟事,    台視, 2014-11-28, Rating,   0.96")
y <- read.table(text=text.y,  header = TRUE, sep=",", strip.white=TRUE)

x

         Drama   TV       Date Vol Rating
1 我的自由年代 三立 2014-02-07  12   2.67
2     16個夏天 公視      11/01  16   2.30
3         妹妹 台視 2014-10-10   7   1.30
4   回到愛以前 台視 2014-01-05  NA   2.58

        Drama   TV       Date variable value
1 喜歡.一個人 三立 2014-07-04      Vol  7.00
2 喜歡.一個人 三立 2014-07-04   Rating  2.03
3    徵婚啟事 台視 2014-11-28      Vol  4.00
4    徵婚啟事 台視 2014-11-28   Rating  0.96

Drama	TV	Date	Vol	Rating
我的自由年代	三立	2014-02-07	12	2.67
回到愛以前	台視	2014-01-05	NA	2.58
16個夏天	公視	2014-11-01	16	2.30
喜歡.一個人	三立	2014-07-04	7	2.03
妹妹	台視	2014-10-10	7	1.30
徵婚啟事	台視	2014-11-28	4	0.96