R包基础实操—tidyverse包

核心软件包是ggplot2、dplyr、tidyr、readr、purrr、tibble、stringr和forcats,它们提供了建模、转换和可视化数据的功能。
其中,readr包用于读取数据,tidyr包用于整理数据,dplyr包用于数据转换,ggplot2包用于数据可视化,purrr包用于函数式编程。
1 readr包:快速读写
1-1 readr包提供了几个新函数,能够更快的读取文件
readr包中的主要的函数有:
read_csv,read_tsv,read_table,read_delim,write_csv,write_tsv,write_excel_csv,write_delim
library(tidyverse)
library(readr)
library(feather)
library(Seurat)
metadata <- pbmc_small@meta.data
# If you add an extension to the file name, write_()* will automatically compress the output.
write_csv(metadata, 'metadata.csv')
write_tsv(metadata, 'metadata2.csv')
write_tsv(metadata, 'metadata.tsv.gz')
# write_rds()联合read_rds()使用,write_rds()将数据保存为自定义的二进制形式(RDS格式)
write_rds(metadata,"challenge.rds")
head(read_rds("challenge.rds"))
## orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8
## ATGCCAGAACGACT SeuratProject 70 47 0
## CATGGCCTGTGCAT SeuratProject 85 52 0
## GAACCTGATGAACC SeuratProject 87 50 1
## TGACTGGATTCTCA SeuratProject 127 56 0
## AGTCAGACTGCACA SeuratProject 173 53 0
## TCTGATACACGTGT SeuratProject 70 48 0
## letter.idents groups RNA_snn_res.1
## ATGCCAGAACGACT A g2 0
## CATGGCCTGTGCAT A g1 0
## GAACCTGATGAACC B g2 0
## TGACTGGATTCTCA A g2 0
## AGTCAGACTGCACA A g2 0
## TCTGATACACGTGT A g1 0
# saveRDS()联合readRDS()使用,saveRDS()将数据保存为自定义的压缩后的二进制形式(RDS格式)
saveRDS(metadata, "challenge.rds")
metadata <- readRDS("challenge.rds")
# write_rds(metadata, "compressed_mtc.rds", "xz", compression = 9L)
# write_rds(metadata,"challenge.rds", compress = 'gz') = saveRDS(metadata, "challenge.rds", compress = TRUE)
# feather包也是实现一种二进制形式,可以在多个编程语言之间共享;相比于RDS,速度更快。
write_feather(metadata,"challenge.feather")
read_feather("challenge.feather")
## # A tibble: 80 x 7
## orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
## <fct> <dbl> <int> <fct> <fct> <chr>
## 1 SeuratProject 70 47 0 A g2
## 2 SeuratProject 85 52 0 A g1
## 3 SeuratProject 87 50 1 B g2
## 4 SeuratProject 127 56 0 A g2
## 5 SeuratProject 173 53 0 A g2
## 6 SeuratProject 70 48 0 A g1
## 7 SeuratProject 64 36 0 A g1
## 8 SeuratProject 72 45 0 A g1
## 9 SeuratProject 52 36 0 A g1
## 10 SeuratProject 100 41 0 A g1
## # ... with 70 more rows, and 1 more variable: RNA_snn_res.1 <fct>
1-2 将R数据写入Excel
write.xlsx(x, file, sheetName = “Sheet1”, col.names = TRUE, row.names = TRUE, append = FALSE)write.xlsx2(x, file, sheetName = “Sheet1”, col.names = TRUE, row.names = TRUE, append = FALSE)
library("xlsx")
# Write the first data set in a new workbook
write.xlsx(USArrests, file = "myworkbook.xlsx", sheetName = "USA-ARRESTS", append = FALSE)
# Add a second data set in a new worksheet
write.xlsx(mtcars, file = "myworkbook.xlsx", sheetName="MTCARS", append=TRUE)
# Add a third data set
write.xlsx(iris, file = "myworkbook.xlsx", sheetName="IRIS", append=TRUE)
2 tidyr包:整理数据
2-1 tidyr包提供了几个新函数,能够有效整理数据
gather(): makes “wide” data longerspread(): makes “long” data widerseparate(): splits a single column into multiple columnsunite(): combines multiple columns into a single column
library(tidyr)
library(dplyr)
DF <- data.frame(Group=rep(1:3, each=4), Year=rep(2006:2009, times=3),
Qtr.1 = rep(seq(14, 20, 2), 3), Qtr.2 = rep(seq(12, 18, 2), 3),
Qtr.3 = rep(seq(16, 22, 2), 3), Qtr.4 = rep(seq(18, 24, 2), 3))
long_DF <- DF %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
head(long_DF)
## Group Year Quarter Revenue
## 1 1 2006 Qtr.1 14
## 2 1 2007 Qtr.1 16
## 3 1 2008 Qtr.1 18
## 4 1 2009 Qtr.1 20
## 5 2 2006 Qtr.1 14
## 6 2 2007 Qtr.1 16
wide_DF <- long_DF %>% spread(Quarter, Revenue)
head(wide_DF, 10)
## Group Year Qtr.1 Qtr.2 Qtr.3 Qtr.4
## 1 1 2006 14 12 16 18
## 2 1 2007 16 14 18 20
## 3 1 2008 18 16 20 22
## 4 1 2009 20 18 22 24
## 5 2 2006 14 12 16 18
## 6 2 2007 16 14 18 20
## 7 2 2008 18 16 20 22
## 8 2 2009 20 18 22 24
## 9 3 2006 14 12 16 18
## 10 3 2007 16 14 18 20
separate_DF <- long_DF %>% separate(Quarter, c("Time_Interval", "Interval_ID"))
head(separate_DF, 10)
## Group Year Time_Interval Interval_ID Revenue
## 1 1 2006 Qtr 1 14
## 2 1 2007 Qtr 1 16
## 3 1 2008 Qtr 1 18
## 4 1 2009 Qtr 1 20
## 5 2 2006 Qtr 1 14
## 6 2 2007 Qtr 1 16
## 7 2 2008 Qtr 1 18
## 8 2 2009 Qtr 1 20
## 9 3 2006 Qtr 1 14
## 10 3 2007 Qtr 1 16
unite_DF <- separate_DF %>% unite(Quarter, Time_Interval, Interval_ID, sep = ".")
head(unite_DF, 10)
## Group Year Quarter Revenue
## 1 1 2006 Qtr.1 14
## 2 1 2007 Qtr.1 16
## 3 1 2008 Qtr.1 18
## 4 1 2009 Qtr.1 20
## 5 2 2006 Qtr.1 14
## 6 2 2007 Qtr.1 16
## 7 2 2008 Qtr.1 18
## 8 2 2009 Qtr.1 20
## 9 3 2006 Qtr.1 14
## 10 3 2007 Qtr.1 16
DF %>% gather(Quarter, Revenue, Qtr.1:Qtr.4) %>% head(2)
## Group Year Quarter Revenue
## 1 1 2006 Qtr.1 14
## 2 1 2007 Qtr.1 16
DF %>% gather(Quarter, Revenue, -Group, -Year) %>% head(2)
## Group Year Quarter Revenue
## 1 1 2006 Qtr.1 14
## 2 1 2007 Qtr.1 16
DF %>% gather(Quarter, Revenue, 3:6) %>% head(2)
## Group Year Quarter Revenue
## 1 1 2006 Qtr.1 14
## 2 1 2007 Qtr.1 16
DF %>% gather(Quarter, Revenue, Qtr.1, Qtr.2, Qtr.3, Qtr.4) %>% head(2)
## Group Year Quarter Revenue
## 1 1 2006 Qtr.1 14
## 2 1 2007 Qtr.1 16
long_DF %>% separate(Quarter, c("Time_Interval", "Interval_ID")) %>% head(2)
## Group Year Time_Interval Interval_ID Revenue
## 1 1 2006 Qtr 1 14
## 2 1 2007 Qtr 1 16
long_DF %>% separate(Quarter, c("Time_Interval", "Interval_ID"), sep = "\\.") %>% head(2)
## Group Year Time_Interval Interval_ID Revenue
## 1 1 2006 Qtr 1 14
## 2 1 2007 Qtr 1 16
separate_DF %>% unite(Quarter, Time_Interval, Interval_ID, sep = "_") %>% head(2)
## Group Year Quarter Revenue
## 1 1 2006 Qtr_1 14
## 2 1 2007 Qtr_1 16
separate_DF %>% unite(Quarter, Time_Interval, Interval_ID) %>% head(2)
## Group Year Quarter Revenue
## 1 1 2006 Qtr_1 14
## 2 1 2007 Qtr_1 16
3 dplyr包:有效数据操作
3-1 tidyr包提供了几个新函数,能够有效操作数据
filter()picks cases based on their values.select()picks variables based on their names.arrange()changes the ordering of the rows.mutate()adds new variables that are functions of existing variables.summarise()reduces multiple values down to a single summary.
Rows:
filter()chooses rows based on column values.arrange()changes the order of the rows.slice()chooses rows based on location.
Columns:
select()changes whether or not a column is included.rename()changes the name of columns.mutate()changes the values of columns and creates new columns.relocate()changes the order of the columns.
Groups of rows:
summarise()collapses a group into a single row.
library(dplyr)
head(starwars)
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, grey light blue 52 male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
starwars %>%
dplyr::select(name, ends_with("color"))
## # A tibble: 87 x 4
## name hair_color skin_color eye_color
## <chr> <chr> <chr> <chr>
## 1 Luke Skywalker blond fair blue
## 2 C-3PO <NA> gold yellow
## 3 R2-D2 <NA> white, blue red
## 4 Darth Vader none white yellow
## 5 Leia Organa brown light brown
## 6 Owen Lars brown, grey light blue
## 7 Beru Whitesun lars brown light blue
## 8 R5-D4 <NA> white, red red
## 9 Biggs Darklighter black light brown
## 10 Obi-Wan Kenobi auburn, white fair blue-gray
## # ... with 77 more rows
starwars %>%
mutate(name, bmi = mass / ((height / 100) ^ 2)) %>%
dplyr::filter(species == "Droid") %>%
dplyr::select(name:mass, bmi) %>%
arrange(desc(mass))
## # A tibble: 6 x 4
## name height mass bmi
## <chr> <int> <dbl> <dbl>
## 1 IG-88 200 140 35
## 2 C-3PO 167 75 26.9
## 3 R2-D2 96 32 34.7
## 4 R5-D4 97 32 34.0
## 5 R4-P17 96 NA NA
## 6 BB8 NA NA NA
starwars %>%
group_by(species) %>%
summarise(n = n(), mass = mean(mass, na.rm = TRUE)) %>%
filter(n > 1, mass > 50)
## # A tibble: 8 x 3
## species n mass
## <chr> <int> <dbl>
## 1 Droid 6 69.8
## 2 Gungan 3 74
## 3 Human 35 82.8
## 4 Kaminoan 2 88
## 5 Mirialan 2 53.1
## 6 Twi'lek 2 55
## 7 Wookiee 2 124
## 8 Zabrak 2 80
4 purrr包:函数式编程
用R写循环从低到高有三种境界:手动 for 循环,apply 函数族,purrr 包泛函式编程。
map(.x, .f, …): Apply a function to each element of a list or vector. map(x, is.logical)map2(.x, .y, .f, …): Apply a function to pairs of elements from two lists, vectors. map2(x, y, sum)pmap(.l, .f, …): Apply a function to groups of elements from list of lists, vectors. pmap(list(x, y, z), sum, na.rm = TRUE)

map系列函数的返回值如下:
map_chr(.x, .f): 返回字符型向量map_lgl(.x, .f): 返回逻辑型向量map_dbl(.x, .f): 返回实数型向量map_int(.x, .f): 返回整数型向量map_dfr(.x, .f): 返回数据框列表,再 bind_rows 按行合并为一个数据框map_dfc(.x, .f): 返回数据框列表,再 bind_cols 按列合并为一个数据框
library(purrr)
infos <- tibble(
born=c(1990, 1992, 2000, 1985),
family=c("张", "李", "王", "赵"),
name=c("三", "四", "五", "六"))
infos
## # A tibble: 4 x 3
## born family name
## <dbl> <chr> <chr>
## 1 1990 张 三
## 2 1992 李 四
## 3 2000 王 五
## 4 1985 赵 六
infos2 <- infos
age <- purrr::map(infos$born, function(x){2020-x}) %>% unlist()
infos$age <- age
infos
## # A tibble: 4 x 4
## born family name age
## <dbl> <chr> <chr> <dbl>
## 1 1990 张 三 30
## 2 1992 李 四 28
## 3 2000 王 五 20
## 4 1985 赵 六 35
fullname <- purrr::map2(infos$family, infos$name, function(x, y){paste0(x,y)}) %>% unlist()
infos$fullname <- fullname
infos
## # A tibble: 4 x 5
## born family name age fullname
## <dbl> <chr> <chr> <dbl> <chr>
## 1 1990 张 三 30 张三
## 2 1992 李 四 28 李四
## 3 2000 王 五 20 王五
## 4 1985 赵 六 35 赵六
fullname <- purrr::pmap(list(x=infos$family, y=infos$name, z=infos$born), function(x, y, z) paste0(x, y, z)) %>% unlist()
infos$fullname2 <- fullname
infos
## # A tibble: 4 x 6
## born family name age fullname fullname2
## <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 1990 张 三 30 张三 张三1990
## 2 1992 李 四 28 李四 李四1992
## 3 2000 王 五 20 王五 王五2000
## 4 1985 赵 六 35 赵六 赵六1985
#
purrr::pmap(list(x=infos$born), function(x){2020-x}) %>% unlist()
## [1] 30 28 20 35
purrr::pmap(list(x=infos$born, y=infos$name), function(x, y){paste0(x,y)}) %>% unlist()
## [1] "1990三" "1992四" "2000五" "1985六"
purrr::pmap(list(x=infos$family, y=infos$name, z=infos$born), function(x, y, z) paste0(x, y, z)) %>% unlist()
## [1] "张三1990" "李四1992" "王五2000" "赵六1985"
names(infos2) <- c('x', 'y', 'z')
infos2
## # A tibble: 4 x 3
## x y z
## <dbl> <chr> <chr>
## 1 1990 张 三
## 2 1992 李 四
## 3 2000 王 五
## 4 1985 赵 六
purrr::pmap(infos2['x'], function(x){2020-x}) %>% unlist()
## [1] 30 28 20 35
purrr::pmap(infos2[c('x', 'y')], function(x, y){paste0(x,y)}) %>% unlist()
## [1] "1990张" "1992李" "2000王" "1985赵"
purrr::pmap(infos2, function(x, y, z){paste0(x,y, z)}) %>% unlist()
## [1] "1990张三" "1992李四" "2000王五" "1985赵六"
比较匿名函数和公式
df <- mtcars %>%
dplyr::select(mpg, cyl, wt) %>%
group_nest(cyl)
# formula
df %>% mutate(model = map(data, ~ lm(mpg ~ wt, data = .x) ))
## # A tibble: 3 x 3
## cyl data model
## <dbl> <list<tibble[,2]>> <list>
## 1 4 [11 x 2] <lm>
## 2 6 [7 x 2] <lm>
## 3 8 [14 x 2] <lm>
map_dbl(mtcars, ~ length(unique(.x)))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 25 3 27 22 22 29 30 2 2 3 6
# anonymous function
df %>% mutate(model = map(data, function(x) {lm(mpg ~ wt, x)} ))
## # A tibble: 3 x 3
## cyl data model
## <dbl> <list<tibble[,2]>> <list>
## 1 4 [11 x 2] <lm>
## 2 6 [7 x 2] <lm>
## 3 8 [14 x 2] <lm>
map_dbl(mtcars, function(x) length(unique(x)))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 25 3 27 22 22 29 30 2 2 3 6
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map("r.squared")
## $`4`
## [1] 0.5086326
##
## $`6`
## [1] 0.4645102
##
## $`8`
## [1] 0.4229655
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
## 4 6 8
## 0.5086326 0.4645102 0.4229655
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_dfr("r.squared")
## # A tibble: 1 x 3
## `4` `6` `8`
## <dbl> <dbl> <dbl>
## 1 0.509 0.465 0.423
mtcars %>%
mutate(cyl = factor(cyl),
am = factor(am)) %>%
dplyr::select(mpg, disp, hp) %>%
map(~ aov(.x ~ cyl * am, data = mtcars)) %>%
map_dfr(~ broom::tidy(.), .id = 'source') %>%
mutate(p.value = round(p.value, 5))
## # A tibble: 12 x 7
## source term df sumsq meansq statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mpg cyl 1 818. 818. 94.6 0
## 2 mpg am 1 37.0 37.0 4.28 0.0479
## 3 mpg cyl:am 1 29.4 29.4 3.41 0.0755
## 4 mpg Residuals 28 242. 8.64 NA NA
## 5 disp cyl 1 387454. 387454. 138. 0
## 6 disp am 1 9405. 9405. 3.35 0.0779
## 7 disp cyl:am 1 688. 688. 0.245 0.624
## 8 disp Residuals 28 78637. 2808. NA NA
## 9 hp cyl 1 100984. 100984. 91.3 0
## 10 hp am 1 7378. 7378. 6.67 0.0153
## 11 hp cyl:am 1 6403. 6403. 5.79 0.0230
## 12 hp Residuals 28 30961. 1106. NA NA
参考资料
R如何实现更快读取数据——使用redr包: https://www.jianshu.com/p/71b4fd0f0a19
[2]
Writing Data From R to Excel Files (xls|xlsx): http://www.sthda.com/english/wiki/writing-data-from-r-to-excel-files-xls-xlsx#writing-excel-files-using-xlsx-package
[3]
Reshaping Your Dat with tidyr: https://uc-r.github.io/tidyr
[4]
数据重塑之tidyr包: https://zhuanlan.zhihu.com/p/22265154
[5]
Introduction to dplyr: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
[6]
dplyr包: https://www.jianshu.com/p/f8b9e6bd52a2
[7]
dplyr新功能解读: https://zhuanlan.zhihu.com/p/145839517
[8]
优雅的循环迭代:purrr包: https://zhuanlan.zhihu.com/p/168772624
[9]
R语言| 向量化操作purrr包: https://www.huaweicloud.com/articles/f522c9f56cf2d8cca5f7b390aa3f2d7c.html
[10]
tidyverse简介与管道: https://zhuanlan.zhihu.com/p/243376822
[11]
R语言编程——基于tidyverse: https://zhuanlan.zhihu.com/p/198185888

