Checksums Flashcards

Learn a bit more about checksums in detail.

1
Q

What is a…

Checksum

A

A string generated by a hash algorithm/hash function that can allow us to determine changes to a stream of data, i.e. by comparing result of a hash algorithm after data transfer to one we generated before data transfer.

In digital preservation we tend to use the term checksum interchangeably with the word hash – the fixed length string generated by something called a cryptographic hash function (MD5, SHA1, SHA256, etc.).

Checksum may also refer to the process of comparing two checksum values – checking the sum – for changes in the data stream.

A checksum will usually be made up of hexadecimal characters 0-9 and A-F, e.g.

d41d8cd98f00b204e9800998ecf8427e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the…

Purpose of a Checksum

A

A checksum algorithm calculates a fixed length string based on the data in a file alone.

A file with the letters USA has MD5 checksum:

f75d91cdd36b85cc4a8dfeca4f24fa14

will always have the check sum

f75d91cdd36b85cc4a8dfeca4f24fa14.

If a single bit changes, it will be unrecognisably otherwise.

A file with the letters USB (USA to USB, a change of two-bits) has checksum:

7aca5ec618f7317328dcd7014cf9bdcf

Checksums are great for spotting data integrity errors – the key to digital preservation.

Bit level preservation is simply about checking the checksums – constantly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the meaning of…

Data vs. Filename

A

Checksums are calculated on the data inside a file. If a filename changes, the checksum of the value is still the same because the data inside hasn’t been changed. If a file is copied, and given another filename the checksum of the two files will be identical.

Checksums only operate on the data inside the file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a…

Hash Function

A

A mapping of data of arbitrary length to a fixed length string, the output of a hash function can be called a hash value, hash code, digest, or simply a hash. A checksum in digital preservation is a hash of the data inside a file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a…

Digest

A

A fixed length string. The output of a hash function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a…

Cryptographic Hash Function

A

A cryptographic hash function is a one way function such that the original data cannot be determined from the hash value itself – it is infeasible to invert the function. Cryptographic hashes are considered quick. The cryptographic hash functions employed in digital preservation have wide application as well and so are considerably well tested and there are many tools that can support their use in our workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a…

One Way Function

A

A transformation of data such that the result cannot be transformed back into the original.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do checksums look like?

A

Fixed length strings. Hexadecimal characters 0-9, A-F.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is…

MD5

A
  • Message Digest 5.
  • 32 character string.
  • Theoretically, 21 quintillion files needed for a collision.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is…

SHA-1

A
  • Secure Hash Algorithm 1.
  • 40 character string.
  • Theoretically 1 septillion files needed for a collision.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is…

SHA-256

A
  • Secure Hash Algorithm 256.
  • 64 character string.
  • Theoretically 400 undecillion files needed for a collision.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Other Cryptographic Hashes…

A
  • BLAKE-256
  • BLAKE-512
  • MD5
  • RadioGatún
  • RIPEMD
  • SHA-1
  • SHA-256
  • Spectral Hash
  • Streebog
  • SWIFFT
  • Tiger
  • Whirlpool
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is…

d41d8cd98f

00b204e980

0998ecf8427e

A

The MD5 checksum of a zero byte file. Other checksums capable of generating a hash from a zero-byte file include:

  • MD5:
  • d41d8cd98f00b204e9800998e*
  • cf8427e*
  • SHA1:
  • da39a3ee5e6b4b0d3255bfef95*
  • 601890afd80709*
  • SHA256:
  • e3b0c44298fc1c149afbf4c8996fb*
  • 92427ae41e4649**b934ca495991b*
  • 7852b855*
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are…

Collisions

A
  • A collision happens when two different data streams result in the same checksum value.
  • This is a big concern when a checksum is used for security purposes (e.g. in password applications).
  • A collision is computationally difficult to engineer but not impossible.
  • Collisions could of course be incidental.
  • An engineered collision for SHA1 recently took knowledge of the algorithm, plus 9,223,372,036,854,775,808 SHA-1 computations, 6,500 years of CPU (Central Processing Unit) time, and 110 years of GPU (Graphics Processing Unit) time, to create.
  • Collisions are not a huge concern in digital preservation because multiple checksums may often be created for a single file to avoid such a situation.
  • Archivists also have the concept of fixity.
  • Collisions are a bigger concern when workflows require on just a single checksum to align large amounts of data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is…

SHA1DEEP

A

A useful tool available for Linux and Windows for generating checksums recursively for a directory or directories of files. SHA1DEEP has compatriot tools MD5DEEP and SHA256DEEP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What checksums are…

Supported in Rosetta

A

Checksums supported in the Rosetta digital preservation system are, CRC32, MD5, and SHA1

17
Q

What checksums are…

Supported by DROID

A

Checksums supported in DROID are MD5, SHA1, and SHA256

18
Q

What is…

AV Preserve Fixity

A

AV Preserve Fixity is a software agent for scheduling the scanning and checking of checksums for a given directory or directories of files. If a comparison fails, that is a file that is expected to match doesn’t, then an email is sent prompting users about the error enabling them to initiate procedures to return original data from backups. The tool is maintained by AV Preserve.

19
Q

What is…

De-duplication

A

Because a data input will always output the same checksum value, checksums are great for de-duplication, that is removal of duplicate files with the same information.

In an archival context this may be more complicated where a duplicate record has multiple contexts.

In some storage systems, checksums can be used to store no more than one copy of an object that can then be referenced from multiple contexts.

20
Q

What is…

Authenticity and Integrity

A

Checksums can prove data hasn’t changed which can help us to prove a record’s authenticity and integrity from the point of transfer.

In UNESCO memory of the world terms, integrity is the quality of being ‘uncorrupted and free of unauthorized and undocumented changes’ (UNESCO 2003).

21
Q

What is the…

UNESCO definition of integrity

A

The state of being whole, uncorrupted and free of unauthorised and undocumented changes. (UNESCO, 2003)

22
Q

What is…

Automation

A

Checksums are unique to a data stream and thus can become unique, fixed-length, identifiers for those files. We can keep track of our files through various automated workflows through the use of checksums.

23
Q

THey’re

Just a large number!

A

Checksums are just really big numbers. Computers are really good at working with numbers that is why they are good for automated processes and comparisons. If we convert hexadecimal:

d41d8cd98f00b204e9800998ecf8427e

to a decimal number in Google we get 2.8194977e+38

24
Q

What is…

Hexadecimal

A

A number system of 16 characters, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. Hexadecimal can represent all numbers. Its primary application is the representation of binary numbers in the form of two digit bytes. Hexadecimal makes binary easier to read, for example, the number 255, in binary is, 0b11111111, and in hexadecimal is 0xFF. A hexadecimal number is often prefixed with the number zero and letter ‘x’ to signal the following characters are hexadecimal.

25
Q

What is the meaning of…

Checksums vs. Fixity

A

If a checksum should fail for any reason, archivists also have the concept of fixity. The concept of ‘remaining fixed in state’. We can observe file date ranges, e.g. modified and creation date. We can also look at the content and clues in the content for features that help us to prove a digital file is what it purports to be. There is only one Domesday Book – we have many ways of proving this is what it is without a checksum value per se.

26
Q

What is…

Deterministic but Unpredictable

A

Cryptographic hashes are deterministic meaning for a given piece of data, the same output will always be generated. That is, the same checksum value.

Output is, however, unpredictable between inputs meaning that similar (not the same) output results in a radically different looking checksum value so the original data cannot be predicted.

27
Q

What is…

Uniform Distribution

A

A feature of a cryptographic hash function that makes it difficult to reverse engineer. The range of outputs for any given input is uniformly distributed meaning every possible output has an equal chance of occurring – you won’t see chunks of similar checksums output for similar (not the same) chunks of data.

28
Q

What is…

Infeasible to Invert

A

Means it is computationally difficult and time consuming to reverse engineer the output of a cryptographic hash function. The one mechanism to do it would be to try all possible combinations of input, yet, original data size is not known, and there are no clues to the original data type or content.

29
Q

What are…

Fuzzy Hashes

A

Having understood checksums, one might also be interested in fuzzy hashes. These are used in an alternative way to the checksums discussed here.

Fuzzy hashes are used to determine the similarity of content – e.g. to determine when only small changes have been made to a data stream.

This property of fuzzy hashes can be exploited to perform content sentencing, or to point users to similar content if there is a record available.