有时我们需要创建额外的变量以添加有关当前数据的更多信息,因为它可以增加值。这在我们进行特征工程时特别有用。如果我们了解可能影响响应的某些内容,那么我们更愿意将其用作数据中的变量,因此我们将其与已有的数据结合起来。例如,创建另一个变量,将条件应用于其他变量,例如,如果频率匹配某个条件,则创建一个二进制变量以求良好。
请看以下数据帧-
set.seed(100) Group<-rep(c("A","B","C","D","E"),times=4) Frequency<-sample(20:30,20,replace=TRUE) df1<-data.frame(Group,Frequency) df1
输出结果
Group Frequency 1 A 29 2 B 26 3 C 25 4 D 22 5 E 28 6 A 29 7 B 26 8 C 25 9 D 25 10 E 23 11 A 26 12 B 25 13 C 21 14 D 26 15 E 26 16 A 26 17 B 30 18 C 27 19 D 21 20 E 22
创建具有两个级别的列类别,分别为好和坏,其中好是针对那些频率大于25−的类别
df1$Category<-ifelse(df1$Frequency>25,"Good","Bad") df1
输出结果
Group Frequency Category 1 A 29 Good 2 B 26 Good 3 C 25 Bad 4 D 22 Bad 5 E 28 Good 6 A 29 Good 7 B 26 Good 8 C 25 Bad 9 D 25 Bad 10 E 23 Bad 11 A 26 Good 12 B 25 Bad 13 C 21 Bad 14 D 26 Good 15 E 26 Good 16 A 26 Good 17 B 30 Good 18 C 27 Good 19 D 21 Bad 20 E 22 Bad
让我们看另一个例子-
Class<-rep(c("Lower","Middle","Upper Middle","Higher"),times=5) Ratings<-sample(1:10,20,replace=TRUE) df2<-data.frame(Class,Ratings) df2
输出结果
Class Ratings 1 Lower 3 2 Middle 8 3 Upper Middle 2 4 Higher 9 5 Lower 2 6 Middle 3 7 Upper Middle 4 8 Higher 4 9 Lower 4 10 Middle 5 11 Upper Middle 7 12 Higher 9 13 Lower 4 14 Middle 2 15 Upper Middle 6 16 Higher 7 17 Lower 1 18 Middle 6 19 Upper Middle 9 20 Higher 9
df2$Group<-ifelse(df2$Ratings>5,"Royal","Standard") df2
输出结果
Class Ratings Group 1 Lower 3 Standard 2 Middle 8 Royal 3 Upper Middle 2 Standard 4 Higher 9 Royal 5 Lower 2 Standard 6 Middle 3 Standard 7 Upper Middle 4 Standard 8 Higher 4 Standard 9 Lower 4 Standard 10 Middle 5 Standard 11 Upper Middle 7 Royal 12 Higher 9 Royal 13 Lower 4 Standard 14 Middle 2 Standard 15 Upper Middle 6 Royal 16 Higher 7 Royal 17 Lower 1 Standard 18 Middle 6 Royal 19 Upper Middle 9 Royal 20 Higher 9 Royal