Count occurrences of a char in a text file

2022-12-24 00:56:32 人气: 19

Overview We'll learn how to use Linux commands to get the number of occurrences of a specific character in an input file. We're assuming that you know some common Linux commands, including grep, awk, and tr. We'll also suppose that our input file tpoint.t...

Overview

We'll learn how to use Linux commands to get the number of occurrences of a specific character in an input file.

We're assuming that you know some common Linux commands, including grep, awk, and tr.

We'll also suppose that our input file tpoint.txt contains some dummy data −

$ cat tpoint.txt
"I Love Tpoint!!!"
"Tpoint is great!!!"

For the rest of the tutorial, we’ll be using tpoint.txt for demonstration purposes.

Using the grep Command

The grep command looks for a specific string in an input file.

We’ll now look at the command to get the number of characters in a file −

$ grep -o 'e' tpoint.txt | wc -l
4

We're searching for the occurrence of the letter 'e' in the file tpoint.txt. The −o option displays the matched part in a different line.

We now use the "|" symbol to connect the output of the grep program to the input of the wc program. The −l option in the `wc` command tells us how many lines there are in the given file.

Case-Insensitive Searching

The grep command allows for the use of the -i option to conduct a case-insensitive search.

$ grep -o -i 'l' tpoint.txt | wc -l
3

Using Multiple Input Files

You can use the grep command to check for multiple inputs at once. For example, if you want to know how long each line in a text file is, you could run the following command −

$ cat > dummy.txt
This is dummy text.
$ grep -o -i 'e' tpoint.txt dummy.txt | wc -l
5

We've added a new file called dummy.txt and performed a character count on both the file, which has been renamed from tpoint.text, and dummy.txt.

We used the grep command to count the number of characters in each file. The result included the total number of characters from both files.

Using the tr Command

The tr is a tool for performing character−based transformations.

We can combine two options, −c and −d, to get the number of characters −

$ tr -c -d 'l' < tpoint.txt | wc -c
2

Let us first understand the options used in the above command.

−c − This option will take the compliment of the set
−d − It will delete all the characters mentioned in the set

A string is defined as a sequence of characters. In our example, the string is just one letter, l.

When we combine the −c and −d options together, it will delete everything but the character specified by the −d option.

The resulting string will be piped into the wc command using a pipe symbol (|). The −c option in the wc command will return the total number of characters.

Case−insensitive Searching

You can perform searches using either upper or lower cases by adding both upper and lower cases to the set.

$ tr -cd 'lL' < tpoint.txt | wc -c
3

Using the awk Command

Awk is a programming tool for processing text files. It reads lines from an input file, performs some actions on each line, then writes the modified lines back into another file.

Unlike the two approaches we've discussed so far, this one is a bit trickier to understand.

Let’s take a look at the command and see how it works.

$ awk -F 'e' '{s+=(NF-1)} END {print s}' tpoint.txt
4

The default character used by the awk command line tool is a space. But here we h**e replaced the default field separators with an e using the -F command line argument. We want to split our data into two columns for every instance of “e”.

To get the number of characters per line, we need to add up the length of each line and then divide by the total number of lines. We can use the following code to achieve this− We add up the counts for each individual word and then finally, we get the overall character occurrence count for the whole document.

Performance Comparison

All three approaches we've discussed so far operate by performing the same basic task. However, the difference between them lies in their implementation of processing the data.

For small strings or small−sized files, the execution times for these commands are almost the same. However, the real differences between them are when their file sizes are too large.

Let's run all three of these command lines on a 1.1 GB file and see which one takes less time.

$ ls -lah large.txt
-rw-r--r--. 1 root root 1.1G Jun 12 10:53 large.txt

$ time grep -o 'e' large.txt | wc -l
82256735

real 0m40.733s
user 0m39.649s
sys 0m0.714s

$ time tr -c -d 'e' < large.txt | wc -c
82256735

real 0m2.542s
user 0m1.892s
sys 0m0.433s

$ time awk -Fe '{s+=(NF-1)} END {print s}' large.txt
82256735

real 0m11.080s
user 0m9.589s
sys 0m0.933s

The tr commands are the fastest of the three for counting characters in large files.

Conclusion

We've learned about different ways to find the number of characters in a text document. We've talked about some special cases like case−insensitve searches and searching from multiple input files.

We've found that the tr command runs faster than either the awk or the grep command.