Testimonial: Isabel Meyer
Type: Software Workflow
Creating and verifying fixity for digital files using checksums is an essential aspect of digital preservation. In the 15-minute video presentation below, the Smithsonian's DAMS Branch Manager Isabel Meyer introduces audiences to core aspects and functions of checksum activities, and provides a useful demo of the Institution's free "md5er" fixity tool. Visit the Smithsonian's GitHub page to download the "md5er" tool.
[Isabel Meyer:] Hi, my name is Isabel Meyer and I manage the Smithsonian Institution’s Digital Asset Management system which is also referred to as the DAMS. In this video I'm going to talk about checksums, what they are how they're generated, how to use them and why they are important to media art management and preservation. I will also demonstrate a checksum software tool that was developed by the Smithsonian’s Digitization Program Office and that is used throughout the Institution by our staff.
In a nutshell a checksum is a unique digital signature that could be assigned to a file in a similar way that a Social Security number is unique to each individual, a checksum is unique to each digital file. In order to understand the concept of a checksum it is important to have a basic understanding of the composition of a digital file. At the most basic level, digital files regardless of format are comprised of numerical patterns of ones and zeros which are defined as a binary representation. There can only be one of two values. A one or a zero. Each numerical representation is stored in what is referred to as a bit. In order to express numbers, letters, words colors, shapes, etc. the bits are organized into logical size grouping known as bytes. These binary patterns are read and understood by computers and are referred to as machine language or machine code. Images and videos are organized and stored by picture elements known as “pixels,” which are grouped into bit patterns defined by a bitmap. Digital video is stored as a series of bit maps called “I-Frames." A bitmap is exactly what it implies. It's essentially a map that provides a computer machine language directions on how to organize the patterns of bits and how to display them. It's really logically beautiful. Computer file design and organization truly is an art form within itself, but that's a different topic.
So taking a look at the slide we'll see that the left hand column contains the decimal representation of the digits 0 through 4 and on the right we have the binary representation of what that number looks like stored in bit format or in a byte this is represented in four bits. So it's a pattern of ones and zeros that you know the machine can read and interpret. All right....so oops. The key takeaway for this discussion is to understand that every media file regardless of format is stored as a complex grouping of ones and zeros that can be read by a computer.
So if you can imagine the number of patterns of ones and zeros it takes to represent a time-based media artwork, that's a boatload of bits. Mind-boggling! So just hold on to that image because we’re going to use that image to talk about checksums. What is a checksum?
Okay, so gather all those patterns of ones and zeros that represent a media file. You feed them through a calculation and provide the resulting value. Basically a checksum is a value that is generated by taking all the ones and zeros that make up a file, running them through a computational algorithm known as a cryptographic hash function, and outputting a value. The value returned by that algorithm is a checksum. So in the screen this series of numbers and letters which is actually a hexadecimal representation is the output from a file being run through a checksum algorithm. This happens to be an md5 output. The second one is also the value of a checksum, um, the result of the checksum algorithm. The format's a little bit different because it's a SHA-1-output checksum. Basically it's a longer, it's a longer variation of a longer calculation. The media file as it was originally created will always contain the same pattern of ones and zeros so every time this cryptographic hash function is executed it will return exactly the same value. However, if the original file is edited or is in any way changed, the pattern of ones and zeros that represent the file will also change. So when you feed that modified file through the algorithm; that checksum algorithm again, since the the sequence and the pattern has changed the output value that it returns, like this this checksum, will be different, that, you know and that could be just a single bit that's changed, so even if a file is corrupted by a disk error after it has been stored to disk or tape or it encounters a file transfer error the pattern of one's and zeros will change, generating a checksum against the file and producing a different value. There are different forms of algorithms and those include MD5, SHA-1, SHA-256, SHA-512, and these different forms, you know, just execute stronger or more secure cryptographic hash functions. Which one to use is best going to be dependent on your environment and your resources. However, the important thing to note is that each algorithm generates a different form of the checksum, as you can see here. So you must use the same algorithm that was generated the original time every time you do a validation. So in other words if you started out with MD5 the next time you run it again it has to be an MD5 checksum. You can't mix formats.
So let's put this all together into a little demo using the tool which I'm going to pull up. Okay, so this is the md5 ER tool and as I said it was developed by the Smithsonian Digitization Program Office and it's a really handy little tool. I really quite like it because it's very, very easy to use. So I am going to open up, just an image and let's just highlight this so make bigger so you could see what it looks like, and it's just a picture of some yummy pastries, and we are going to go get a checksum on that file.
So the way this works we are just gonna pull up the md5 tool again, and it says Drop a Folder to MD5. So you can either browse or you can just drag and drop and I'm just gonna pick up this folder and drop it into the tool. It gives you a couple of options on file name. You can either just have it a file name output will be md5.md5 or uses the folder name and appends md5 or it uses the parent folder and subfolders and appends the md5 hash because this utility executes an MD5 checksum. So now we can go back to our folder. Let me get back to that and open it up and now you see that I have a file with the name of my folder directory, and you know, checksum-demo and the md5 suffix. So I'm gonna go ahead and open this. I can just use Microsoft Word, okay, to display it and maybe we can make this a little bit bigger...well maybe not, heh heh. So uh oh there we go. Let's just use this little block at the bottom. There we go. And as you can see this is the result of the checksum. It ran through, it looked inspected all the the bits on that file, ran it against the algorithm that generates the md5 checksum, and that's the output.
So now let's take that same file, I don't want to save it, and we are going to make an edit to the file. Just zoom this it up a little bit. Just do something very, very basic here and I'm just going to, just do a crop. So a cropped. Saved it, close that out and now I have edited that file. I'm going to go back and open up the md5er tool and drop the folder into the tool and open it up again. Okay, and now you can see this is the original checksum with the file name and then that second time when I executed it added another entry to the file and that has changed because I modified the file, so it's a really handy tool. And close this out.
The Smithsonian makes it available. You can go to GitHub, so you can go to GitHub Smithsonian MD5 tool and there it has all the documentation and you can download it. The version that I used is the older version. It's, uh, the MD5ER-2016, it's the older one. It's been updated and the differences in the two versions is that the one that I used just inspects a single folder and the contents, all the contents that are in that folder and the newer one will execute on any subfolders and do a recursive generation but you know, they are both they're both excellent! Very handy tools to have.
So when acquiring digital media artworks it's important to also acquire a checksum for every component file. Ideally, if possible checksums should be generated when the acquired artwork was created by the artist. But if that is not feasible then a checksum should be generated as part of the formal acquisition process before the artwork is accepted or ingested into your preservation archival repository, or what I like to refer to as the Digital Art Warehouse. If the artwork is being transferred over one or more networks then you might want to consider generating a checksum and validating it at every transfer point, so, and the checksum value should be stored as part of the digital artwork documentation and preservation package.
This is from the Smithsonian Institution Digital Art Model, Preservation Section in our DAMS. So in our DAMS we are storing born-digital, Time-based media artworks and we have a preservation section as part of our model and this is the part that focuses on the checksums so the preservation ID for this particular item that was ingested, this component, is what the originally acquired checksum. When it gets ingested into the DAMS we do a checksum again, and we validate it, and let's take a look here, and then we also perform a yearly audit. So every year we go through all our digital artworks and we perform quite an extensive audit one of which is validating the checksum, and this one was executed last year in October and it gives what the result of that audit is. Thank you for taking the time to listen to this video and this conversation on checksums. I hope the information is useful. Thank you very much!!