网站首页 > 博客文章正文

Linux下文本的去重&排序

baijin 2024-12-16 11:17:42 博客文章 13 ℃ 0 评论

平时工作中，我们会遇到对文本进行去重、排序等操作处理的需求，本文介绍了Linux系统中常用的排序去重命令sort、uniq和awk。

1、文本排序

linux下文本排序一般使用sort命令，既可以按行排序，也可以按照分隔字段排序。本文使用文件test.txt内容如下：

1 b 3 a 5 d 3 d 4 c 2 e
10 g
1 a 3 d 2 e

共10行文本，每行2列。

1.1 按行排序

sort命令按行排序，是按照字符的ascii码顺序进行排序

[test@localhost ~]$ sort test.txt 
10 g
1 a
1 b
2 e
2 e
3 a
3 d
3 d
4 c
5 d
[test@localhost ~]$ sort -r test.txt 
5 d
4 c
3 d
3 d
3 a
2 e
2 e
1 b
1 a
10 g

-r表示降序排序。

1.2 按列排序

sort命令按列排序，使用-t设置字段分隔符

[test@localhost ~]$ sort -t' ' -k1 test.txt 
10 g
1 a
1 b
2 e
2 e
3 a
3 d
3 d
4 c
5 d
[test@localhost ~]$ sort -t' ' -k1n test.txt 
1 a
1 b
2 e
2 e
3 a
3 d
3 d
4 c
5 d
10 g
[test@localhost ~]$ sort -t' ' -k1nr test.txt 
10 g
5 d
4 c
3 a
3 d
3 d
2 e
2 e
1 a
1 b
[test@localhost ~]$ sort -t' ' -k2r test.txt 
10 g
2 e
2 e
3 d
3 d
5 d
4 c
1 b
1 a
3 a

命令解析：-k1nr，其中k1表示第1列，n表示按照数字排序，r表示降序排序。

2、文本行去重

linux下文本按行去重，一般使用uniq命令。

例如文件test2.txt内容如下：

1 b 3 a 5 d 5 d 4 c 2 e
4 c

使用uniq命令去重，如下：

[test@localhost ~]$ uniq test2.txt 
1 b
3 a
5 d
4 c
2 e
4 c

从去重结果可以看出uniq命令只能把相邻的重复行去掉。故使用uniq命令最好先进行sort排序，这样才能去除所有重复行。

[test@localhost ~]$ sort test2.txt | uniq
1 b
2 e
3 a
4 c
5 d

从上面结果可以看出排序后，再使用uniq可以进行去重操作。

如果不想改变文本行的顺序情况下进行去重，可以结合awk命令，如下：

[test@localhost ~]$ cat test2.txt | awk
'{
  if(lines[$0]==0)
  {
    lines[$0]=1;
    print($0);
  }
}'
1 b
3 a
5 d
4 c
2 e

命令解析：

awk是linux下可编程的文本处理命令。lines是定义的数组，$0表示整行文本。如果没有赋值，则lines[$0]值为0，否则赋值为1，这样能去重。print($0)表示输出。

使用awk命令去重，还有简化命令，如下：

[test@localhost ~]$ cat test2.txt | awk '!lines[$0]++'
1 b
3 a
5 d
4 c
2 e

命令含义同上面awk完整版本。awk的程序指令由模式和操作组成，即Pattern { Action }的形式，如果省略Action，则默认执行 print($0) 的操作。实现去除重复功能的就是这里的Pattern：!lines[$0]++。在awk中，对于未初始化的数组变量，在进行数值运算的时候，会赋予初值0，因此lines[$0]=0。++运算符的特性是先取值，后加1，因此Pattern等价于!0而0为假，!为取反。因此整个Pattern最后的结果为1，相当于if(1)，Pattern匹配成功，输出当前记录。

3、文本字段去重

linux下文本按字段去重，无法使用uniq命令，可以使用awk命令。对于文件test.txt内容如下：

1 b 3 a 5 d 3 d 4 c 2 e
10 g
1 a 3 f
2 e

共10行文本，每行2列。

按第1列去重，命令如下：

[test@localhost ~]$ awk -F' ' '!fields[$1]++' test.txt 
1 b
3 a
5 d
4 c
2 e
10 g

命令解析：-F' '表示字段使用空格' '分割。其他命令原理同前面awk按行去重。

按第2列去重，命令如下：

[test@localhost ~]$ awk -F' ' '!fields[$2]++' test.txt 
1 b
3 a
5 d
4 c
2 e
10 g
3 f

上述命令中$1和$2表示分割后的第1个字段和第2个字段内容。

网站首页 > 博客文章正文

Linux下文本的去重&排序

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎你发表评论:

网站首页 > 博客文章 正文

Linux下文本的去重&amp;排序

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 博客文章正文

Linux下文本的去重&排序

取消回复欢迎你发表评论: