hpr4617 :: UNIX Curio #4 - Archiving Files

Utilities and file formats, especially "pax"

Hosted by Vance on Tuesday, 2026-04-14 is flagged as Clean and is released under a CC-BY-SA license.
unix curio, unix, archive, tar, cpio, pax. (Be the first).

Listen in ogg, opus, or mp3 format. Play now:

Duration: 00:15:57
Download the transcription and subtitles.

general.

This series is dedicated to exploring little-known—and occasionally useful—trinkets lurking in the dusty corners of UNIX-like operating systems.

When you think about creating and managing archives on a UNIX system, tar is probably the utility that comes to mind. But this was not the first archiving program; ar was in First Edition UNIX 1 and cpio also pre-dates it, sort of 2 . According to the NetBSD manual page, cpio was developed within AT&T before tar , but did not get widely released until System III UNIX after tar was already well known from the earlier release of Seventh Edition UNIX (a.k.a. Version 7).

You might think that ar and cpio are old and irrelevant these days, but these formats do live on. Each Debian package file 3 is an ar archive which in turn contains two tar files. On Red Hat, Fedora, SUSE, and some other distributions, each .rpm package file 4 contains a cpio payload. So these may very well still be in use on your modern Linux system.

But let's get back to the subject of what you might want to use to create archives today. The tar utility has persisted in its popularity over the decades, and you most probably have a version installed on your UNIX-like systems. One of the problems with tar , however, is that it has not kept a consistent file format. Also, different implementations have used differing syntax at times.

There are excellent reasons for the file format changing 5 . The names people give files have gotten longer over time, and the original Seventh Edition tar format could only handle a total pathname length of 100 bytes for each archive member. In addition, filenames were in ASCII format, and modern filesystems now accommodate richer encodings with characters that aren't in ASCII. The size of each archive member was limited to 8 gigabytes—unthinkably large back then, but not so big these days. User and group ownership could only be specified by numeric ID, which can vary from one system to another. Many other types of files and information simply couldn't be stored: block and character device nodes, FIFOs, sockets, extended attributes, access control lists, and SELinux contexts.

As a result, the tar format had to evolve over the years. One important version was the ustar format, created for the 1988 POSIX standard. The POSIX committee wanted to try standardizing both the file format and syntax for the tar command. While the ustar format addressed some shortcomings, progress marched on. Filesystems started allowing filenames in different character sets and more types of information to be attached to files, so for the 2001 revision of POSIX they gave up on standardizing the tar utility and came up with a new format and utility, which is our actual UNIX Curio for this episode: pax 6 . Since the pax program didn't have historical baggage, they could specify its options, behavior, and file format and be sure everyone's implementation would match. Developers of different tar implementations had been reluctant to change away from their historical option syntax to the standard. The pax utility was also an attempt to avoid taking sides between those who advocated for tar and fans of cpio . The pax file format was an extension of ustar with the ability to add arbitrary new attributes tied to each archive member as UTF-8 Unicode. Some of these attribute names were standardized, but implementers could also define their own, making the format more future-proof. Older versions of tar that could handle the ustar format should still be able to process pax archives, but might not know what to do with the extra attributes.

GNU tar developed its current archive format 7 alongside the standardization of the ustar format. The GNU format was based on an early draft which later underwent incompatible changes, so the two unfortunately are not interchangable. Unlike ustar , the GNU format has no limits on the size of files or the length of their names. In addition to its own format, GNU tar is able to detect and correctly process both ustar and pax archives. In situations where its native format can't store necessary information about a file (such as POSIX access control lists or extended attributes), GNU tar will automatically output the pax format instead (called "posix" in documentation). However, it still uses the GNU format by default, though the documentation has been threatening to move to the POSIX format for at least 20 years 8 .

The good news is that the ustar , pax , GNU tar , and Seventh Edition tar formats are well documented and utilities across many UNIX-like systems 2,7,9,10,11 are able to handle these, depending on which formats existed when the utility was developed. While your system may not have pax itself installed, there are other archiving utilities that can read the file format, including GNU tar . (Somewhat amusingly, Debian and some other Free Software operating systems package a pax utility developed by MirBSD 12 which largely follows the POSIX-specified interface, but doesn't support reading or writing archives in pax format!) Look at the manual page for the tar , cpio , or pax utilities on your system to see if they can handle pax archives.

Perhaps one aspect that has worked in favor of tar and other UNIX archive formats is that they only concern themselves with storing files and make no attempt at compression. Instead, it is common for a complete archive file to be compressed after creation; many utilities can be told to do this step for you, but it is not typically the default behavior. Therefore, if a better compression method comes along, the archive format doesn't need to change. If you do use compression, be careful to choose a method that is available on the destination system. Compressing files is a big enough subject to deserve its own episode, so we won't talk more about it here.

So which format should you use when creating an archive? Unfortunately, there is no single answer that applies in all circumstances. The pax format is supported among modern UNIX-like systems and can represent all types of files and metadata. While other systems, their filesystems, and archive utilities might not be able to properly make use of all the metadata, they should at least be able to extract the data contained in files and, if Unicode is supported, give them appropriate filenames. If you intend to unpack the archive on an older system, more research might be needed to figure out what formats it is able to handle. The Seventh Edition tar format (often called "v7") is widely supported, including by older systems, but has limitations in what it can contain as described earlier.

Moving beyond the UNIX world, things get even more complicated. Apple's macOS, with its FreeBSD underpinnings, easily handles tar files. However, when it comes to MS-DOS and Windows, it's a bit different. There, a multitude of archiving programs and formats arose, usually combining archiving with compression. PKZIP was probably the most popular of these and its .zip format became common in many places, helped by the fact that PKWARE openly published the specification. While there is only a single .zip format, it has many options, some proprietary, and different implementations have diverged in the way some aspects are handled (or not handled). An ISO/IEC standard for .zip 13 was published in 2015 giving a baseline profile, and sticking to it produces files that can be widely extracted successfully. Other file formats like OpenDocument use the .zip format and typically hew to the standardized profile.

Windows' File Explorer, starting with Windows XP, can natively extract .zip files 14 . The Info-ZIP program 15 is a Free Software implementation for a wide variety of systems (even rather obscure ones); while it might not be installed on yours, if you're copying the archive file over, you can probably copy over its unzip utility at the same time to unpack it. So .zip probably has the broadest support, although it might not already be present on every system. However, as Klaatu points out in Hacker Public Radio episode 4557 16 , .zip files and applications handling them aren't always great at maintaining metadata about files. The .zip format doesn't seem to have any way to represent UNIX file permissions, and user/group ownership can only be included as numeric IDs. Other types of metadata on UNIX-like systems are not saved at all. This is probably not a problem in some cases, such as with a collection of photos, but for others it might be a concern.

While pax as a utility does not seem to have gained much popularity or support, except on commercial UNIX systems where including it was required to conform to the POSIX standard, its file format has persisted. Free Software systems have generally avoided the pax interface, preferring to stick with the tar utility on the command line, but usually have good support for archive files in the pax format. Outside of UNIX-like systems, .zip seems to have become the most common file format, and support for it is also good in the UNIX world, though it might not be built in.

References:

  1. Archive (library) file format https://man.cat-v.org/unix-1st/5/archive
  2. NetBSD 10.0 cpio manual page https://man.netbsd.org/NetBSD-10.0/cpio.1
  3. Debian binary package format https://manpages.debian.org/trixie/dpkg-dev/deb.5.en.html
  4. RPM V6 Package format https://rpm.org/docs/6.0.x/manual/format_v6.html
  5. NetBSD 10.0 libarchive-formats manual page https://man.netbsd.org/NetBSD-10.0/libarchive-formats.5
  6. Pax specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html
  7. GNU tar manual https://www.gnu.org/software/tar/manual/tar.html
  8. GNU tar manual for version 1.15.90 https://web.cvs.savannah.gnu.org/viewvc/*checkout*/tar/tar/manual/tar.html?revision=1.3
  9. FreeBSD 15.0 libarchive-formats manual page https://man.freebsd.org/cgi/man.cgi?query=libarchive-formats&sektion=5&apropos=0&manpath=FreeBSD+15.0-RELEASE+and+Ports
  10. OpenBSD 7.8 tar manual page https://man.openbsd.org/OpenBSD-7.8/tar
  11. HP-UX Reference (11i v3 07/02) - 1 User Commands N-Z (vol 2) https://support.hpe.com/hpesc/public/docDisplay?docId=c01922474&docLocale=en_US
  12. MirBSD pax(1) manual page http://www.mirbsd.org/htman/i386/man1/pax.htm#Sh.STANDARDS
  13. ISO/IEC 21320-1:2015 Information technology - Document Container File Part 1: Core https://www.iso.org/standard/60101.html
  14. Mastering File Compression on Windows https://windowsforum.com/threads/mastering-file-compression-on-windows-how-to-zip-and-unzip-files-effortlessly.369235/
  15. About Info-ZIP https://infozip.sourceforge.net/
  16. HPR4557::Why I prefer tar to zip https://hackerpublicradio.org/eps/hpr4557/index.html


Comments

Subscribe to the comments RSS feed.

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Provide feedback
Your Name/Handle:
Title:
Comment:
Anti Spam Question: What does the letter P in HPR stand for?