File encryption in 2024

Lately I was wondering how 7zip and other tools secure data when password-protection is enabled.
After some hours, I realized that I was jumping from one interesting topic to another one and couldn't stop researching encryption approaches and common pitfalls when using them.

This is a brief summary of things I learned in the process.

What is encryption?

Encryption is a way of scrambling data, so that only authorized parties can understand the information.
It is the process of converting human-readable plaintext to incomprehensible text, also known as ciphertext.

File encryption is a security method that converts your files into ciphertext or unreadable data.
By using this method, you can be sure that, even if unauthorized people access your files, they won't be able to understand the contents without the decryption key.

In cryptography, a key is a string of characters used within an encryption algorithm for altering data.
Like a physical key, it locks (encrypts) data so that only someone with the right key can unlock (decrypt) it.

Doesn't sound difficult, right? So let's try it out:

# Encrypt "Hello World" (plaintext)
echo "Hello World" | openssl enc -base64 -e -aes-256-cbc -salt -pass pass:Example123 -pbkdf2
# Result: U2FsdGVkX1+pjGlozc7YPZrmAZjw3GyO4BwnWo1ZbVI=

# Decrypt the ciphertext
echo U2FsdGVkX1+pjGlozc7YPZrmAZjw3GyO4BwnWo1ZbVI= | openssl enc -base64 -d -aes-256-cbc -salt -pass pass:Example123 -pbkdf2
# Result: Hello World

Seems to work! But what are those parameters actually doing?
What is AES-256? CBC? Salt? ...and PBKDF2?

Well... let's dive into all those topics and find out their purpose!

⚠️ Caution:
This example uses a weak password ("Example123") and CBC mode for demonstration purposes only.
In practice, always use strong, unique passwords with at least 12-16 characters including uppercase, lowercase, numbers, and special characters and consider using the GCM mode,
as described in the following sections.

Symmetric vs Asymmetric

We differentiate between two encryption methods:

Symmetric encryption:
Uses a single secret key for encryption and decryption. The two parties exchanging information through symmetric encryption must exchange keys to decrypt the encoded files.
Symmetric encryption is generally faster and more efficient than asymmetric encryption. Organizations usually use this method when they need to encrypt information in bulk, such as an entire database.

However, symmetric encryption has the disadvantage of key distribution challenges.
Keeping the secret key secure, is often more challenging.

For example, if encryption and decryption occur in different locations, the secret key must move between the two places, potentially becoming vulnerable to attacks in the process (e.g., Man-in-the-middle attacks).

Asymmetric or public-key cryptography:
Uses a public and a private key. Anyone with the public key can use it to encrypt files.
However, only users with the private key can decrypt the files, so the files remain safe from unauthorized access.
This solves the key distribution problem but is computationally more expensive.

Learning from HTTPS/TLS

While file encryption typically uses symmetric encryption with password-derived keys, it's worth understanding how modern internet security solves the key distribution problem through hybrid encryption, as used in HTTPS/TLS connections.

Hybrid encryption combines the best of both worlds:

  • Asymmetric encryption (RSA/ECDH): Used to securely exchange the symmetric key.
  • Symmetric encryption (AES): Used for the actual encryption.

When you visit a secure website, your browser and the server use this hybrid approach.
They first establish a shared AES key using public-key cryptography, then use that AES key to encrypt all subsequent communication.

This same principle applies to file encryption scenarios, where you need to share encrypted files with others without sharing passwords beforehand.
Tools like GPG use hybrid encryption to encrypt files with a recipient's public key, which then protects a symmetric key used for the actual file encryption.

However, for personal file encryption (like password-protecting a 7zip archive), the simpler approach of deriving a symmetric key from a password is, of course, more practical.

Advanced Encryption Standard (AES)

In the example above, we used the Advanced Encryption Standard (AES) to encrypt our plaintext.
Since its development, AES has been used to secure sensitive data around the world.

The National Institute of Standards and Technology (NIST) replaced the now obsolete Data Encryption Standard (DES) with AES and made it the official encryption standard.
It was the first (and only) publicly accessible cipher approved by the U.S. National Security Agency (NSA) for top secret information (see Wikipedia).

AES is currently considered the encryption standard for most of the encryption you see online.

Key Sizes and Security

AES supports three key sizes:

  • AES-128: 128-bit keys (16 bytes)
  • AES-192: 192-bit keys (24 bytes)
  • AES-256: 256-bit keys (32 bytes)

The number in "AES-256" refers to the length of the cryptographic key in bits.

Tests have shown that, while a dedicated hacker could crack a 56-bit key in about 23 hours, with a 128-bit AES key, testing all possible keys would take about 10.79 quintillion years,
which is more than 700 million times the age of the universe.

Since brute force attacks are not considered viable, attackers typically move on to other attack vectors, like obtaining the password or passphrase from other sources.
This is why using a strong password and storing it safely is crucial.

AES in Practice

AES is widely adopted across various applications and tools, according to my brief research (2024):

File Encryption Tools:

  • Adobe PDF: Uses AES 256-bit encryption
  • KeePass: Primarily uses AES-256 along with SHA-256, HMAC-SHA-256 and SHA-512
  • 7zip: Uses AES-256 (or AES-128) with CBC mode by default
  • WinRAR: Uses AES-128 encryption in CBC mode
  • BitLocker: Uses AES with configurable 128-bit or 256-bit keys (AES-128 is default)

Full Disk Encryption:

  • VeraCrypt: Offers AES alongside Serpent, Twofish, Camellia, and Kuznyechik
  • LUKS: Supports AES among other algorithms like Serpent, Twofish, CAST-128, and CAST-256

Being the only hardware-accelerated encryption algorithm among common choices, AES is by far the most widely used,
especially in enterprise environments and network attached storage (NAS) devices.

What is the PBKDF2?

As the name suggests, the Password-Based Key Derivation Function 2 (PBKDF2) is used to derive (cryptographic) keys from a password.
PBKDF2 keys are especially resistant against brute-force and rainbow table attacks.

PBKDF2 applies a pseudorandom function to the input password/passphrase along with a salt value (see example above) and repeats the process many
times (iterations) to produce a derived key that can then be used as cryptographic key in subsequent operations. This procedure is called key stretching.

The purpose of key stretching is typically to make passwords or passphrases more secure against brute-force attacks, by increasing
the resources it takes to test each possible key.

Since the key stretching algorithm is deterministic, a weak input always generates the same output and is therefore vulnerable to
rainbow table (precomputed hashes) attacks. For this reason, key stretching should always be combined with "salting".

By always applying a different "salt" (e.g., 16 random bytes), the resulting key is different for each encryption,
reducing the ability to use rainbow table attacks.

In short:
PBKDF2 can take our plaintext password and turn it into a strong 256-bit cryptographic key ("key stretching") that we can use for the AES-256 encryption.
If we provide a random salt for each encryption, we also reduce the ability of rainbow table attacks.

While PBKDF2 is widely used and secure when properly configured, newer key derivation functions offer improved resistance to modern attack methods,
particularly those using specialized hardware like GPUs and ASICs (see also Argon2, scrypt, ...).

AES modes (CBC, GCM)

In the example above, we used the Cipher Block Chaining Mode (CBC Mode) of AES.

AES can operate in different modes that determine how the encryption algorithm processes data blocks.
Each mode has different characteristics, security properties, and use cases.

CBC Mode

CBC (Cipher Block Chaining) mode encrypts data in fixed-size blocks (16 bytes for AES) where each block is XORed with the previous ciphertext block before encryption.
The first block uses an initialization vector (IV) instead of a previous ciphertext block.

Characteristics:

  • Requires an IV that should be random and unpredictable for each encryption
  • Each block depends on the previous one (Sequential processing)
  • Identical plaintext blocks produce different ciphertext when using different IVs
  • Padding is required when data doesn't align to block boundaries

Security considerations:

  • Vulnerable to padding oracle attacks if not implemented carefully, as mentioned earlier
  • The IV must be unpredictable. Using predictable IVs can lead to security vulnerabilities
  • Bit-flipping attacks are possible where changes to ciphertext affect the next plaintext block
  • An attacker can modify ciphertext without detection as there is no integrity/authenticity check, like the Message Authentication Code (MAC), in place

Galois/Counter Mode (GCM)

GCM is an authenticated encryption mode that combines counter mode encryption with Galois mode authentication.
It's widely preferred for modern applications because it provides, in contrast to the CBC mode, both confidentiality and authenticity.

Characteristics:

  • Provides encryption and authentication in a single operation
  • Blocks can be encrypted/decrypted independently (Parallel processing)
  • Produces an authentication tag that verifies data integrity (GMAC / ICV)
  • Can authenticate additional data (AAD) without encrypting it
  • Uses a nonce (number used once) instead of an initialization vector (IV)

Advantages:

  • Detects manipulation and prevents many attack vectors (GMAC / ICV)
  • Parallel processing makes it faster on modern hardware
  • Can authenticate additional unencrypted data alongside encrypted content (AAD)
  • Widely supported and recommended by security standards

There are also other AES modes that I haven't looked into in detail yet (CTR, ECB, CFB, ... - see Wikipedia):

To sum up ...

Now that we understand the theoretical foundations, let's circle back to the original question:

How do tools like 7zip actually secure your data when a password-protection is used?

This is what happens behind the scenes:

Key Derivation:

  • 7zip uses PBKDF2 (or optionally Argon2 in newer versions) to derive a strong encryption key from your password
  • It generates a random salt for each archive to prevent rainbow table attacks
  • The iteration count is set high enough to make brute-force attacks computationally expensive

Encryption Process:

  • AES-256 is used as the primary encryption algorithm (you can also choose AES-128)
  • CBC mode is the default, though some versions support other modes
  • Each file within the archive is encrypted separately with the derived key
  • File metadata (names, sizes, timestamps) can also be encrypted for additional privacy

Authentication:

  • 7zip includes integrity checks to detect if the archive has been manipulated
  • CRC32 checksums verify that decrypted data matches the original files

If you are interested in this topic, try to experiment with a Cryptography/AES library (e.g., System.Security.Cryptography for .NET) for your programming language of choice.
By using those libraries, you will soon be able to encrypt files yourself and understand the parameters and concepts behind them better.

Thanks for reading!

by Philipp Meier, Aug 2024