person using laptop
Thu Aug 10

Checksum Algorithm: A Guide to Data Verification

Data is everywhere in the digital world. We use data to communicate, store, process, and share information. However, data is also vulnerable to errors and attacks that can compromise its quality and security. How can we ensure that the data we use is accurate and authentic? One of the solutions is to use a checksum algorithm.

What is checksum?

A checksum is a sequence of numbers and letters that is used to check data for errors. It is also sometimes called a hash sum, a hash value, or simply a hash. A checksum is the result of running an algorithm, called a cryptographic hash function, on a piece of data, usually a file. By comparing the checksum that you generate from your version of the file with the one provided by the source of the file, you can ensure that your copy of the file is genuine and error-free.

Checksums are important for data integrity and security, as they can help you detect any accidental or malicious changes that may have occurred during transmission or storage. For example, if you download a software update from a website, you can use a checksum to verify that the file you downloaded is not corrupted or tampered with by hackers. Similarly, if you send an email attachment to someone, you can use a checksum to confirm that the file they received is identical to the one you sent.

How does a checksum algorithm work?

The basic idea is to apply some mathematical operation to the data and produce a smaller output that represents the data. The output is usually a fixed- length string of bits or characters that can be easily compared with another checksum.

There are many ways to design a checksum algorithm, depending on the type and size of the data, the desired level of error detection or correction, the computational complexity and speed, etc. However, most checksum algorithms share some common characteristics:

  • They are deterministic , meaning that they always produce the same output for the same input.
  • They are one-way , meaning that it is easy to compute the output from the input but hard to compute the input from the output.
  • They are sensitive , meaning that small changes in the input result in large changes in the output.

The following are some examples of common checksum algorithms:

Parity

The simplest checksum algorithm is the so-called longitudinal parity check , which breaks the data into “words” with a fixed number n of bits , and then computes the bitwise exclusive or (XOR) of all those words. The result is appended to the message as an extra word. In simpler terms, for n =1 this means adding a bit to the end of the data bits to guarantee that there is an even number of ‘1’s.

To check the integrity of a message , the receiver computes the bitwise exclusive or of all its words , including the checksum; if the result is not a word consisting of n zeros , the receiver knows a transmission error occurred.

With this checksum, any transmission error which flips a single bit of the message, or an odd number of bits, will be detected as an incorrect checksum. However, an error that affects two bits will not be detected if those bits lie at the same position in two distinct words. Also swapping of two or more words will not be detected. If the affected bits are independently chosen at random, the probability of a two-bit error being undetected is 1/n .

Parity is a very simple and fast checksum algorithm, but it has low reliability and can only detect odd-numbered errors. It is often used as a basic error detection method in communication protocols such as UART , Ethernet , etc.

Sum

A variant of the previous algorithm is to add all the “words” as unsigned binary numbers , discarding any overflow bits. The result is either appended to the message as an extra word or replaces the last word of the message. This is called a sum or arithmetic checksum .

To check the integrity of a message, the receiver adds all the words of the message, including the checksum, and verifies that the result is zero (or some other predefined value). If not, it means that there is an error in the message.

Sum checksums are slightly more reliable than parity checksums, as they can detect some errors that affect two or more bits or words. However, they are still vulnerable to errors that cancel each other out, such as swapping two words or adding and subtracting the same value. They are also sensitive to the order and alignment of the words in the message.

Sum checksums are commonly used in Internet protocols such as TCP , IP , ICMP , etc.

CRC

A more sophisticated checksum algorithm is based on polynomial division . The data is treated as a polynomial with coefficients of 0 and 1, and is divided by another polynomial of a fixed degree. The remainder of this division is the checksum, which is appended to the message.

To check the integrity of a message, the receiver performs the same polynomial division on the message and verifies that the remainder is zero. If not, it means that there is an error in the message.

This type of checksum is called a cyclic redundancy check (CRC) , and it has many variants depending on the choice of the divisor polynomial. Some common CRC algorithms are CRC-8 , CRC-16 , CRC-32 , etc.

CRC checksums have high reliability and can detect most errors that affect any number of bits or words in the message. They can also correct some errors by using additional information or techniques. They are widely used in data storage and transmission devices such as hard disks , CD-ROMs , USB flash drives , etc.

Hash

A hash function is a special type of checksum algorithm that produces a fixed- length output from a variable-length input. The output is usually much smaller than the input, and it is often called a digest or a fingerprint of the input.

Hash functions have many applications in cryptography , such as encryption , digital signatures , authentication , etc. However, they can also be used as checksums for verifying data integrity and authenticity.

A hash function has two main properties:

  • It is collision-resistant , meaning that it is hard to find two different inputs that produce the same output.
  • It is preimage-resistant , meaning that it is hard to find an input that produces a given output.

Some common hash functions are MD5 , SHA-1 , SHA-256 , etc.

Hash functions have very high reliability and security, and can detect any changes in the input data. However, they are also computationally intensive and slow compared to other checksum algorithms. They are often used in combination with other methods, such as digital signatures or HMACs , to provide both data integrity and data authenticity.

How to calculate a checksum?

To calculate a checksum, you need to run a program that puts your file through a cryptographic hash function. The hash function takes an input file of any size and produces a string of a fixed length that represents that file. The input file can be a small 1 MB file or a massive 4 GB file, but either way, you will end up with a checksum of the same length.

The hash function works in such a way that even a small change in the file will produce a very different checksum, making it easy to spot any discrepancies between two files. For example, if you change just one letter or punctuation mark in your file, you will get a completely different checksum. This makes checksums very sensitive and accurate for error detection.

There are different ways to calculate a checksum on different platforms, depending on the tools and commands available. Here are some examples of how to calculate a checksum on Windows, Mac, and Linux using built-in or third- party tools.

Windows

On Windows, you can use the built-in CertUtil command to calculate a checksum using various algorithms. To use this command, you need to open the Command Prompt and navigate to the folder that contains the file you want to check. Then, you need to type the following command and press Enter:

certutil -hashfile <file> <algorithm>

Replace <file> with the name of your file and <algorithm> with the name of the algorithm you want to use, such as MD5, SHA1, SHA256, or SHA512. For example, if you want to calculate the MD5 checksum of a file named test.txt, you would type:

certutil -hashfile test.txt MD5

The command will output the checksum of your file in hexadecimal format. You can copy and paste this value for later comparison or verification.

Alternatively, you can use a third-party tool like Microsoft’s FCIV (File Checksum Integrity Verifier) or GtkHash to calculate a checksum on Windows. These tools are graphical user interfaces that allow you to select your file and algorithm and generate a checksum with a few clicks. You can download these tools from their official websites and follow their instructions to install and use them.

Mac

On Mac, you can use the built-in shasum command to calculate a checksum using various algorithms. To use this command, you need to open the Terminal and navigate to the folder that contains the file you want to check. Then, you need to type the following command and press Enter:

shasum -a <algorithm> <file>

Replace <algorithm> with the number of the algorithm you want to use, such as 1 for SHA-1, 256 for SHA-256, or 512 for SHA-512. Replace <file> with the name of your file. For example, if you want to calculate the SHA-256 checksum of a file named test.txt, you would type:

shasum -a 256 test.txt

The command will output the checksum of your file in hexadecimal format. You can copy and paste this value for later comparison or verification.

Alternatively, you can use a third-party tool like Checksum+ or HashTab to calculate a checksum on Mac. These tools are graphical user interfaces that allow you to select your file and algorithm and generate a checksum with a few clicks. You can download these tools from their official websites or from the App Store and follow their instructions to install and use them.

Linux

On Linux, you can use the built-in md5sum, sha1sum, sha256sum, or sha512sum commands to calculate a checksum using various algorithms. To use these commands, you need to open the Terminal and navigate to the folder that contains the file you want to check. Then, you need to type the following command and press Enter:

<command> <file>

Replace <command> with the name of the command that corresponds to the algorithm you want to use, such as md5sum for MD5, sha1sum for SHA-1, sha256sum for SHA-256, or sha512sum for SHA-512. Replace <file> with the name of your file. For example, if you want to calculate the SHA-512 checksum of a file named test.txt, you would type:

sha512sum test.txt

The command will output the checksum of your file in hexadecimal format. You can copy and paste this value for later comparison or verification.

Alternatively, you can use a third-party tool like GtkHash or Hasher GUI to calculate a checksum on Linux. These tools are graphical user interfaces that allow you to select your file and algorithm and generate a checksum with a few clicks. You can download these tools from their official websites or from your distribution’s package manager and follow their instructions to install and use them.

How to verify a checksum?

To verify a checksum, you need to compare it with the expected value provided by the source of the file. The source of the file is usually the website or person that offers the file for download or transfer. The expected value is usually displayed on the website or sent along with the file as a separate text file.

To compare two checksums, you need to make sure that they are calculated using the same algorithm and that they are in the same format (usually hexadecimal). Then, you need to check if they are identical in every character. If they are identical, it means that your file is genuine and error-free. If they are different, it means that your file is corrupted or tampered with.

There are different ways to verify a checksum on different platforms, depending on the tools and methods available. Here are some examples of how to verify a checksum on Windows, Mac, and Linux using built-in or third-party tools.

Windows

On Windows, you can use the built-in CertUtil command to verify a checksum using various types. To use this command, you need to open the Command Prompt and navigate to the folder that contains the file you want to check. Then, you need to type the following command and press Enter:

certutil -hashfile <file> <type>

Replace <file> with the name of your file and <type> with the name of the type you used to generate your checksum, such as MD5, SHA1, SHA256, or SHA512. For example, if you want to verify the MD5 checksum of a file named test.txt, you would type:

certutil -hashfile test.txt MD5

The command will display the checksum of your file in hexadecimal format. You can copy and paste this value and compare it with the expected value provided by the source of the file. If they match, it means that your file is verified. If they don’t match, it means that your file is not verified.

Alternatively, you can use a third-party tool like Microsoft’s FCIV (File Checksum Integrity Verifier) or GtkHash to verify a checksum on Windows. These tools are graphical user interfaces that allow you to select your file and type and compare your checksum with the expected value with a few clicks. You can download these tools from their official websites and follow their instructions to install and use them.

Mac

On Mac, you can use the built-in shasum command to verify a checksum using various types. To use this command, you need to open the Terminal and navigate to the folder that containsthe file you want to check. Then, you need to type the following command and press Enter:

shasum -a <type> <file>

Replace <type> with the number of the type you used to generate your checksum, such as 1 for SHA-1, 256 for SHA-256, or 512 for SHA-512. Replace <file> with the name of your file. For example, if you want to verify the SHA-256 checksum of a file named test.txt, you would type:

shasum -a 256 test.txt

The command will display the checksum of your file in hexadecimal format. You can copy and paste this value and compare it with the expected value provided by the source of the file. If they match, it means that your file is verified. If they don’t match, it means that your file is not verified.

Alternatively, you can use a third-party tool like Checksum+ or HashTab to verify a checksum on Mac. These tools are graphical user interfaces that allow you to select your file and type and compare your checksum with the expected value with a few clicks. You can download these tools from their official websites or from the App Store and follow their instructions to install and use them.

Linux

On Linux, you can use the built-in md5sum, sha1sum, sha256sum, or sha512sum commands to verify a checksum using various types. To use these commands, you need to open the Terminal and navigate to the folder that contains the file you want to check. Then, you need to type the following command and press Enter:

<command> <file>

Replace <command> with the name of the command that corresponds to the type you want to use, such as md5sum for MD5, sha1sum for SHA-1, sha256sum for SHA-256, or sha512sum for SHA-512. Replace <file> with the name of your file. For example, if you want to verify the SHA-512 checksum of a file named test.txt, you would type:

sha512sum test.txt

The command will display the checksum of your file in hexadecimal format. You can copy and paste this value and compare it with the expected value provided by the source of the file. If they match, it means that your file is verified. If they don’t match, it means that your file is not verified.

Alternatively, you can use a third-party tool like GtkHash or Hasher GUI to verify a checksum on Linux. These tools are graphical user interfaces that allow you to select your file and type and compare your checksum with the expected value with a few clicks. You can download these tools from their official websites or from your distribution’s package manager and follow their instructions to install and use them.

Common problems and solutions

Checksums are very useful for data integrity and security, but they are not perfect. There are some problems and issues that can occur with checksums, such as collisions, false positives, false negatives, and corrupted files. Here are some explanations and solutions for these problems and issues.

Collisions

A collision occurs when two different files produce the same checksum. This can happen because there are more possible files than possible checksums. For example, if you use a 32-bit checksum type like MD5, there are only 2^32 possible checksums, but there are infinitely many possible files. Therefore, there is a chance that two different files will have the same checksum.

Collisions can compromise the reliability of checksum verification because they can make you think that two files are identical when they are not. For example, if a hacker manages to create a malicious file that has the same checksum as a legitimate file, they can trick you into downloading or installing their file instead of the original one.

To avoid collisions, you should use a more secure and collision-resistant checksum type like SHA-256 or SHA-512. These types have longer checksums (64-bit or 128-bit) that reduce the probability of collisions significantly. You should also check the source of the file carefully and make sure that it is trustworthy and reputable.

False Positives

A false positive occurs when two files have different contents but produce the same checksum. This can happen because of a bug or error in the hash function or in the program that generates or verifies the checksum. For example, if there is a typo or mistake in the code of the hash function or in the command that you use to generate or verify the checksum, it can result in a wrong output.

False positives can compromise the accuracy of checksum verification because they can make you think that two files are identical when they are not. For example, if there is an error in the hash function that makes it ignore some parts of the file or add some extra characters to it, it can result in a false positive.

To avoid false positives, you should use a well-tested and reliable hash function and program that generate and verify checksums correctly. You should also double-check the commands and parameters that you use to generate and verify checksums and make sure that they are correct and consistent.

False Negatives

A false negative occurs when two files have identical contents but produce different checksums. This can happen because of a change or difference in the format, encoding, or metadata of the file. For example, if you convert a file from one format to another, such as from PDF to DOCX, or if you change the encoding of a file, such as from UTF-8 to UTF-16, or if you modify the metadata of a file, such as the date or author, it can result in a different checksum.

False negatives can compromise the validity of checksum verification because they can make you think that two files are different when they are not. For example, if you change the format or encoding of a file without changing its content, it can result in a false negative.

To avoid false negatives, you should use a checksum type and program that are compatible with the format, encoding, and metadata of your file. You should also avoid changing or modifying these aspects of your file unless necessary. If you do change or modify them, you should generate and verify a new checksum for your file.

Corrupted Files

A corrupted file is a file that has been damaged or altered in some way that makes it unreadable or unusable. This can happen because of various reasons, such as a power outage, a disk failure, a virus infection, or a human error. A corrupted file can produce a different checksum than the original one.

Corrupted files can compromise the quality and functionality of your data because they can make it impossible or difficult to open, view, edit, or execute your file. For example, if your file is corrupted, it can display an error message, show garbled text or images, crash your program, or cause other problems.

To avoid corrupted files, you should backup your data regularly and store it in a safe and secure location. You should also scan your data for viruses and malware and remove any threats that you find. You should also use a checksum verification tool to check your data for errors and fix any issues that you find.

Conclusion

Checksums are powerful tools that can help you ensure data integrity and security. By generating and verifying checksums for your files, you can check if your data is genuine and error-free. You can also detect and resolve any problems or issues that may occur with checksums, such as collisions, false positives, false negatives, and corrupted files.