Skip to main content Logo

Quis custodiet ipsos custodes?
Home | About | All pages | Cluster Status | RSS Feed

Generate hashes of files with rhash for archival storage

Published: 11-12-2015 | Author: Remy van Elst | Text only version of this article

❗ This post is over eight years old. It may no longer be up to date. Opinions may have changed.

Recently I had to archive a large amount of files to archival storage. To save space and reduce the amount of files I decided to create archives with tar. The files will be stored to tapes and DVD's, and will be restored in full, so random access times are not an issue, therefore the tar.gz choice.

I do want to make sure that when the files need to be restored they still are correct. I first dabbled with some long shell commands to create checksums and verify them, but then I found the rhash tool in the repositories. It allows you to create checksums of files and folders, recursively, with all sorts of checksums, like CRC, MD5, SHA1 and many more. It also makes bulk validation very simple.

This small article shows you how to create an archive file with the checksums included and shows you how to validate these checksums later on.

Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below:

I'm developing an open source monitoring app called Leaf Node Monitoring, for windows, linux & android. Go check it out!

Consider sponsoring me on Github. It means the world to me if you show your appreciation and you'll help pay the server costs.

You can also sponsor me by getting a Digital Ocean VPS. With this referral link you'll get $100 credit for 60 days.

The data in question are archived tapes, disk copies, source code and documentation for the PDP8 mainframe. We also have these for the PDP11 and a few VAX machines. The archives contain about 5 million files and is about 700 GB in size. The company decided to phase out the on-line storage and place this data on tapes and dvd's, since they're not accessed more than once or twice a month.

Creating the hashes

The first archive contains PDP8 files located in the folder pdp8. This command creates the MD5SUMS file, which we place in the same folder:

rhash --recursive --md5 --output=pdp8/MD5SUMS pdp8/

The archive is later on created with a simple tar -czf pdp8.tar.gz pdp8.

Verifying the hashes

Extract the archive to a folder and use the following command to verify all files:

rhash --skip-ok --check pdp8/MD5SUMS 

If all files match the output looks like this:

--( Verifying pdp8/MD5SUMS )----------------------------------------------------
Everything OK

If a file does not match the hash, the output will include it:

--( Verifying pdp8/MD5SUMS )----------------------------------------------------
pdp8/pdp8/readme.txt                                ERR
Errors Occurred: Errors:1   Miss:0   Success:3323 Total:3324

If you leave out the --skip-ok option all files checked will be shown which might result in long output.

To manually verify one file, first get the checksum:

grep 'pdp8/readme.txt' pdp8/MD5SUMS 
53a1aca1631d55de3feece9e1c4d900a  pdp8/pdp8/readme.txt

Then manually execute the correct checksum command to verify the match:

$ md5sum pdp8/pdp8/readme.txt 
53a1aca1631d55de3feece9e1c4d900a  pdp8/pdp8/readme.txt
Tags: archive , bash , blog , gzip , md5sum , rhash , tar