数据分析 R简介

本节专门向用户介绍 R 编程语言。 R 可以从 cran 网站下载。对于 Windows 用户，安装 rtools 和rstudio IDE。

R 背后的一般概念是作为与其他使用 C、C++ 和 Fortran 等编译语言开发的软件的接口，并为用户提供分析数据的交互式工具。

导航到图书 zip 文件 bda/part2/R_introduction 的文件夹并打开 R_introduction.Rproj 文件。这将打开一个 RStudio 会话。然后打开 01_vectors.R 文件。逐行运行脚本并按照代码中的注释进行操作。为了学习，另一个有用的选择是键入代码，这将帮助您习惯 R 语法。在 R 中，注释是用 # 符号编写的。

书中为了展示运行R代码的结果，在代码求值后，对R返回的结果进行注释。这样，你可以复制粘贴书中的代码，然后直接在 R 中尝试其中的部分。

# Create a vector of numbers 
numbers = c(1, 2, 3, 4, 5) 
print(numbers) 
# [1] 1 2 3 4 5  
# Create a vector of letters 
ltrs = c('a', 'b', 'c', 'd', 'e') 
# [1] "a" "b" "c" "d" "e"  
# Concatenate both  
mixed_vec = c(numbers, ltrs) 
print(mixed_vec) 
# [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"

让我们分析一下前面代码中发生了什么。我们可以看到可以用数字和字母创建向量。我们不需要事先告诉 R 我们想要什么类型的数据类型。最后，我们能够创建一个包含数字和字母的向量。向量 mix_vec 将数字强制转换为字符，我们可以通过可视化如何将值打印在引号内来看到这一点。

以下代码显示了函数类返回的不同向量的数据类型。通常使用类函数来"询问"一个对象，询问他的类是什么。

### Evaluate the data types using class
### One dimensional objects 
# Integer vector 
num = 1:10 
class(num) 
# [1] "integer"  
# Numeric vector, it has a float, 10.5 
num = c(1:10, 10.5) 
class(num) 
# [1] "numeric"  
# Character vector 
ltrs = letters[1:10] 
class(ltrs) 
# [1] "character"  
# Factor vector 
fac = as.factor(ltrs) 
class(fac) 
# [1] "factor"

R 也支持二维对象。在下面的代码中，有 R 中使用的两种最流行的数据结构的示例：矩阵和 data.frame。

# Matrix
M = matrix(1:12, ncol = 4) 
#      [,1] [,2] [,3] [,4] 
# [1,]    1    4    7   10 
# [2,]    2    5    8   11 
# [3,]    3    6    9   12 
lM = matrix(letters[1:12], ncol = 4) 
#     [,1] [,2] [,3] [,4] 
# [1,] "a"  "d"  "g"  "j"  
# [2,] "b"  "e"  "h"  "k"  
# [3,] "c"  "f"  "i"  "l"   
# Coerces the numbers to character 
# cbind concatenates two matrices (or vectors) in one matrix 
cbind(M, lM) 
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] 
# [1,] "1"  "4"  "7"  "10" "a"  "d"  "g"  "j"  
# [2,] "2"  "5"  "8"  "11" "b"  "e"  "h"  "k"  
# [3,] "3"  "6"  "9"  "12" "c"  "f"  "i"  "l"   
class(M) 
# [1] "matrix" 
class(lM) 
# [1] "matrix"  
# data.frame 
# One of the main objects of R, handles different data types in the same object.  
# It is possible to have numeric, character and factor vectors in the same data.frame  
df = data.frame(n = 1:5, l = letters[1:5]) 
df 
#   n l 
# 1 1 a 
# 2 2 b 
# 3 3 c 
# 4 4 d 
# 5 5 e

如上例所示，可以在同一个对象中使用不同的数据类型。一般来说，这就是数据在数据库中的呈现方式，API 部分数据是文本或字符向量和其他数字。分析人员的工作是确定要分配的统计数据类型，然后为其使用正确的 R 数据类型。在统计学中，我们通常认为变量具有以下类型-

数字名义或分类序数

在 R 中，向量可以是以下类别-

数字-整数因素有序因子

R 为每种统计类型的变量提供了数据类型。然而有序因子很少使用，但可以由函数因子创建，或有序。

以下部分讨论索引的概念。这是一个很常见的操作，处理选择对象的部分并对其进行转换的问题。

# Let's create a data.frame
df = data.frame(numbers = 1:26, letters) 
head(df) 
#     numbers  letters 
# 1 1 a 
# 2 2 b 
# 3 3 c 
# 4 4 d 
# 5 5 e 
# 6 6 f 
# str gives the structure of a data.frame, it’s a good summary to inspect an object 
str(df) 
#   'data.frame': 26 obs. of  2 variables: 
#   $ numbers: int  1 2 3 4 5 6 7 8 9 10 ... 
#   $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...  
# The latter shows the letters character vector was coerced as a factor. 
# this can be explained by the stringsAsFactors = true argumnet in data.frame 
# read ?data.frame for more information  
class(df) 
# [1] "data.frame"  
### Indexing
# Get the first row 
df[1, ] 
#     numbers  letters 
# 1       1       a  
# Used for programming normally-returns the output as a list 
df[1, , drop = true] 
# $numbers 
# [1] 1 
#  
# $letters 
# [1] a 
# Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z  
# Get several rows of the data.frame 
df[5:7, ] 
# numbers letters 
# 5       5       e 
# 6       6       f 
# 7       7       g  
### Add one column that mixes the numeric column with the factor column 
df$mixed = paste(df$numbers, df$letters, sep = ’’)  
str(df) 
# 'data.frame': 26 obs. of  3 variables: 
# $ numbers: int  1 2 3 4 5 6 7 8 9 10 ...
# $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ... 
# $ mixed  : chr  "1a" "2b" "3c" "4d" ...  
### Get columns 
# Get the first column 
df[, 1]  
# It returns a one dimensional vector with that column  
# Get two columns 
df2 = df[, 1:2] 
head(df2)  
#      numbers  letters 
# 1       1       a 
# 2       2       b 
# 3       3       c 
# 4       4       d 
# 5       5       e 
# 6       6       f  
# Get the first and third columns 
df3 = df[, c(1, 3)] 
df3[1:3, ]  
# numbers mixed 
# 1       1     1a
# 2 2 2b 
# 3 3 3c  
### Index columns from their names 
names(df) 
# [1] "numbers" "letters" "mixed"   
# this is the best practice in programming, as many times indeces change, but 
variable names don’t 
# We create a variable with the names we want to subset 
keep_vars = c("numbers", "mixed") 
df4 = df[, keep_vars]  
head(df4) 
#      numbers  mixed 
# 1 1 1a 
# 2       2     2b 
# 3       3     3c 
# 4 4 4d 
# 5 5 5e 
# 6 6 6f  
### subset rows and columns 
# Keep the first five rows 
df5 = df[1:5, keep_vars] 
df5 
#      numbers  mixed 
# 1       1     1a 
# 2       2     2b
# 3       3     3c 
# 4       4     4d 
# 5       5     5e  
# subset rows using a logical condition 
df6 = df[df$numbers < 10, keep_vars] 
df6 
#      numbers  mixed 
# 1       1     1a 
# 2       2     2b 
# 3       3     3c 
# 4       4     4d 
# 5       5     5e 
# 6       6     6f 
# 7       7     7g 
# 8       8     8h 
# 9       9     9i

找工作要求35岁以下，35岁以上的程序员都干什么去了？

长久以来，一直有一个问题困扰着技术人——如何打破“程序员的35岁职业魔咒”，这一天迟早会到来，或早或晚。

或许是选错了行业，程序员薪水虽高，但光鲜的外表下，背后的苦衷只有自己知道。三十多岁本该是一个人事业的黄金期，但技术变化日新月异，行业竞争异常残酷，对一个企业来说，永远有比你更年轻、劳动成本更低的人可以选择，这让你的中年危机提前到来。破局的智慧可以看看这本书！>>

<< 数据分析数据可视化数据分析 SQL简介 >>

昵称：邮箱：