hpr4657 :: UNIX Curio #8 - Comparing Files

Finding out what's the diff, without using "diff"

Hosted by Vance on Tuesday, 2026-06-09 is flagged as Clean and is released under a CC-BY-SA license.
Tags: unix curio, unix, diff, cmp, comm. Comments: 5.

Listen in ogg, opus, or mp3 format. Play now:

Duration: 00:14:20
Download the transcription and subtitles.

Part of the series: general.

This series is dedicated to exploring little-known—and occasionally useful—trinkets lurking in the dusty corners of UNIX-like operating systems.

Most users of UNIX-like systems are probably familiar with the diff utility. It is widely used with source code to compare two files and see what the differences are between them. Non-programmers, like me, also use it to examine what has changed in different versions of scripts or configuration files. Quite a few pieces of newer software can compare different versions of data and express changes in a format either identical to or similar to diff output.

However, there are two other long-standing tools for this purpose that are far less known and deserve in my view to be termed UNIX Curios. The first of these is cmp ¹ . While diff is primarily intended to be used on text files and compares them line by line, cmp compares files byte by byte. In my experience, its main use is to see whether two binary files are in fact identical—if they are, cmp outputs nothing and returns an exit status of 0. Back when methods of transferring files were not as reliable as they are today, this was a tool I would reach for sometimes. For example, you could use it to confirm that the data on a CD-ROM you burned was the same as the original.

If there is a difference between the files, cmp will return an exit status of 1. By default, it will also print the location (byte and line number) of the first differing byte. When used with the -l option, it will print the location and value of every byte that differs. There is one exception to these: if the files are the same except that one is shorter than the other, it will print a message to that effect. The exit status will still be 1 in that case.

Using the -s option with cmp will cause it to be totally silent and output nothing. Only the exit status will indicate whether the files are the same, different, or if the exit status is greater than 1, that an error occurred. This makes it useful for scripting, for example in case you wanted to confirm that a file copied to another location arrived fully intact.

It is worth noting that diff is also capable of comparing binary files—however, it is not required by POSIX to report what is actually different or where differences occur. The same exit status as in cmp is returned: 0 if the files are the same, 1 if they are different, or greater than 1 if an error occurred. While many implementations offer an option to suppress the output, this is not in the standard ² so the most portable method would be to instead redirect output to /dev/null . On my system the diff utility is three times the size of cmp , so if you don't need its extra capabilities, it is a less efficient way of doing the job.

The other UNIX Curio for today is comm , and this utility ³ is also intended to compare two files to see what is common between them. Ken Fallon briefly talked about it a few years ago in HPR episode 3889 . Compared to the others, it has a much more specific use case. The two files are expected to be text files that are already sorted. What comm will do is print a tab-separated list of all the lines appearing in either or both files. Lines only in the first file will appear in the first column, lines only in the second file will be in the second column, and lines in both files will be in the third column.

Any combination of the options -1 , -2 , and -3 can be used with comm to suppress printing of the first, second, or third column respectively. Using all three options at the same time is supported but it results in no output, so that isn't very useful. Unlike the other utilities, the exit status of comm doesn't tell you anything about the two files. It will be 0 if the program ran successfully, and greater than 0 if it didn't.

I'm not sure if I have ever actually used comm for anything practical. I find its default output a bit difficult to meaningfully interpret, plus you need to ensure the two files are already sorted. It seems to be best suited to comparing lists, and one use case that Ken Fallon mentioned would be comparing two lists of files to see if any are missing. The command comm -3 listA listB would print files that only appear in listA in the first column and those only in listB in the second column. This would let you ignore all the filenames that appear in both and focus on those that were absent from one or the other. If on the other hand you only wanted to see the filenames that are on both lists, comm -12 listA listB would give you that.

Some more frivolous potential uses also come to mind. If for some reason the cat utility is broken on your system, you could use comm listA /dev/null to print the file listA instead. If you want to insert tab characters before every line of a file but have an aversion to using sed or awk , then comm /dev/null listA would output listA with one tab before each line, and comm listA listA would insert two tabs. A bit silly, but it would work. The GNU implementation of comm even lets you choose something other than a tab to separate the columns ⁴ , so you could go wild with that.

According to the POSIX specifications for cmp and comm , one of the two filenames given as arguments, but not both, can be a "- ", in which case standard input will be used for that "file" in the comparison. Also, the results are undefined if both arguments are the same FIFO special, character special, or block special file. Some implementations might not have these limitations, but you shouldn't rely on that everywhere.

All three of these were developed quite early. The cmp utility appeared in 1971's First Edition UNIX⁵ , while comm and diff seem to have made their debut in Fourth Edition UNIX^6,7 from 1973. The original versions might not have behaved exactly like their modern counterparts, and newer implementations (especially of the diff utility) have acquired additional options and capabilities, but the basic operation of each has stayed the same.

The next time you need to compare files against each other, consider whether cmp or comm might be appropriate before automatically reaching for diff . They all have their uses in different situations.

References:

Cmp specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/cmp.html
Diff specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/diff.html
Comm specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/comm.html
GNU coreutils manual: comm https://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html
First Edition UNIX cmp manual page http://man.cat-v.org/unix-1st/1/cmp
Fourth Edition UNIX comm manual page https://www.tuhs.org/cgi-bin/utree.pl?file=V4/usr/man/man1/comm.1
Fourth Edition UNIX diff source https://www.tuhs.org/cgi-bin/utree.pl?file=V4/usr/source/s1/diff1.c

Comments

Comment #1 posted on 2026-06-09 23:31:58 by xmanmonk

Great Show (again)

Another great show on these often forgotten commands. Glad to hear you have some more episodes in the works! Looking forward to them!

Comment #2 posted on 2026-06-11 16:57:57 by candycanearter07

comparisons

cmp is, while not as useful as it may have been, still quite useful for testing whether two files are bit copies of each other.

Comment #3 posted on 2026-06-13 11:47:53 by Whiskeyjack

HPR4657 - use of comm

comm is actually a very, very, useful program in scripts if you know how to make good use of it.

For example, bash can be fairly slow when used in a classic looping algorithm over a large amount of data.

However, if you can reformat the data so that it can be compared with comm, then you can use comm as a filter without any loops.

As an example, in one application a simple loop took 3.7 seconds to work its way through the data, which was far too long.

However, by using awk and sort to reformat one file, and a combination of find, cut, sort, uniq, and awk on the directory structure to generate a second file and then comparing them with comm, I was able to filter the information down to just the records that had relevant changes, and then use the slower looping algorithm on those.

This cut that time down to 0.150 seconds, which was more or less instantaneous from a user perspective. Despite this method appearing to have a lot more transformations in it, it was 25 times faster. This is because there are actually far fewer calls to commands in the second algorithm, even though more different commands are involved.

So comm is a very useful command to know, and if you have a lot of information to process it should be one of the tools that you turn to when figuring out the best way to do it.

Comment #4 posted on 2026-06-15 03:16:05 by Vance

Appreciate the comments

When I was uploading this episode, I started to feel like maybe I was selling "comm" short a bit. Nice to hear that it's been useful for you, Whiskeyjack. While its functionality can't be easily replicated with "cut" or "awk", those utilities can certainly make good use of the output from "comm".

Interesting to read about your experience in terms of runtime. I have taken to using a single call to "awk" in situations where I might otherwise call on several utilities in a pipeline. Your comment is a good reminder that if something will be used repeatedly, it's best to measure its runtime instead of automatically assuming that one tool or set of tools will be faster than another.

Comment #5 posted on 2026-06-15 17:02:57 by Whiskeyjack

Reply to Vance on awk in HPR4657

Awk is another extremely useful command to know in terms of improving performance in cases where you might otherwise need to use a loop.

I don't have any relative numbers to hand in this instance, but I know that I have increased performance very significantly by structuring an algorithm to allow use of awk.

The history of awk may be a good subject for a Unix Curio episode.

"Expect" would be a good command to cover as well. I have used this for scripted log-ins to test VMs over SSH in cases where SSH keys wouldn't work for some reason.

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Your Name/Handle:
Title:
Comment:
Anti Spam Question:	What does the letter P in HPR stand for?
Are you a spammer?	Yes No
Who is the host of this show?
What does HPR mean to you?