Checksums Flashcards
Learn a bit more about checksums in detail.
What is a…
Checksum
A string generated by a hash algorithm/hash function that can allow us to determine changes to a stream of data, i.e. by comparing result of a hash algorithm after data transfer to one we generated before data transfer.
In digital preservation we tend to use the term checksum interchangeably with the word hash – the fixed length string generated by something called a cryptographic hash function (MD5, SHA1, SHA256, etc.).
Checksum may also refer to the process of comparing two checksum values – checking the sum – for changes in the data stream.
A checksum will usually be made up of hexadecimal characters 0-9 and A-F, e.g.
d41d8cd98f00b204e9800998ecf8427e
What is the…
Purpose of a Checksum
A checksum algorithm calculates a fixed length string based on the data in a file alone.
A file with the letters USA has MD5 checksum:
f75d91cdd36b85cc4a8dfeca4f24fa14
will always have the check sum
f75d91cdd36b85cc4a8dfeca4f24fa14.
If a single bit changes, it will be unrecognisably otherwise.
A file with the letters USB (USA to USB, a change of two-bits) has checksum:
7aca5ec618f7317328dcd7014cf9bdcf
Checksums are great for spotting data integrity errors – the key to digital preservation.
Bit level preservation is simply about checking the checksums – constantly.
What is the meaning of…
Data vs. Filename
Checksums are calculated on the data inside a file. If a filename changes, the checksum of the value is still the same because the data inside hasn’t been changed. If a file is copied, and given another filename the checksum of the two files will be identical.
Checksums only operate on the data inside the file.
What is a…
Hash Function
A mapping of data of arbitrary length to a fixed length string, the output of a hash function can be called a hash value, hash code, digest, or simply a hash. A checksum in digital preservation is a hash of the data inside a file.
What is a…
Digest
A fixed length string. The output of a hash function.
What is a…
Cryptographic Hash Function
A cryptographic hash function is a one way function such that the original data cannot be determined from the hash value itself – it is infeasible to invert the function. Cryptographic hashes are considered quick. The cryptographic hash functions employed in digital preservation have wide application as well and so are considerably well tested and there are many tools that can support their use in our workflows.
What is a…
One Way Function
A transformation of data such that the result cannot be transformed back into the original.
What do checksums look like?
Fixed length strings. Hexadecimal characters 0-9, A-F.
What is…
MD5
- Message Digest 5.
- 32 character string.
- Theoretically, 21 quintillion files needed for a collision.
What is…
SHA-1
- Secure Hash Algorithm 1.
- 40 character string.
- Theoretically 1 septillion files needed for a collision.
What is…
SHA-256
- Secure Hash Algorithm 256.
- 64 character string.
- Theoretically 400 undecillion files needed for a collision.
Other Cryptographic Hashes…
- BLAKE-256
- BLAKE-512
- MD5
- RadioGatún
- RIPEMD
- SHA-1
- SHA-256
- Spectral Hash
- Streebog
- SWIFFT
- Tiger
- Whirlpool
What is…
d41d8cd98f
00b204e980
0998ecf8427e
The MD5 checksum of a zero byte file. Other checksums capable of generating a hash from a zero-byte file include:
- MD5:
- d41d8cd98f00b204e9800998e*
- cf8427e*
- SHA1:
- da39a3ee5e6b4b0d3255bfef95*
- 601890afd80709*
- SHA256:
- e3b0c44298fc1c149afbf4c8996fb*
- 92427ae41e4649**b934ca495991b*
- 7852b855*
What are…
Collisions
- A collision happens when two different data streams result in the same checksum value.
- This is a big concern when a checksum is used for security purposes (e.g. in password applications).
- A collision is computationally difficult to engineer but not impossible.
- Collisions could of course be incidental.
- An engineered collision for SHA1 recently took knowledge of the algorithm, plus 9,223,372,036,854,775,808 SHA-1 computations, 6,500 years of CPU (Central Processing Unit) time, and 110 years of GPU (Graphics Processing Unit) time, to create.
- Collisions are not a huge concern in digital preservation because multiple checksums may often be created for a single file to avoid such a situation.
- Archivists also have the concept of fixity.
- Collisions are a bigger concern when workflows require on just a single checksum to align large amounts of data.
What is…
SHA1DEEP
A useful tool available for Linux and Windows for generating checksums recursively for a directory or directories of files. SHA1DEEP has compatriot tools MD5DEEP and SHA256DEEP.