LLM Data Privacy - Synthetic Data vs. Tokenization?

Evis Drenova

@evisdrenova

April 23rd, 2024

Introduction

AI data privacy is a hot topic these days and there are two emerging ways of protecting sensitive data when working with LLMs: synthetic data and tokenization.

In this blog, we'll cover both synthetic data and tokenization and their use-cases.

Let's jump in.

Synthetic Data

Synthetic data is getting more and more attention these days with the rise of AI/ML and LLMs. It's being used by most popular foundation models to train them such aws Microsoft's PHI-3 mode. Increasingly, more companies are using synthetic data for security and privacy reasons as well as to train models. We think of this as Synthetic Data Engineering. In the simplest definition, synthetic data is data that a machine has completely made up from scratch. For example, you can program a pseudo-random number generator (PRNG) to randomly select 5 numbers between 0 and 25. You can then use these randomly selected numbers as indexes in the alphabet to randomly select 5 letters. If you put those 5 letters together, you've created a synthetic string!

Obviously, this is a very simple example but the point remains. You can write programs to create data that "looks" just like real data. Additionally, using machine learning models such as generative adversarial networks or GANs, you can create synthetic data that has the same statistical characteristics as your real data. There are a number of synthetic data generators available that address different use cases. The key is to balance generation speed and accuracy. Some use cases such as analytics call for more accuracy (statistically) while other use cases, developer testing, call for more generation speed.

Synthetic Data Use cases

Synthetic data is being used by developers to build applications and machine learning engineers to train models. For developers, synthetic data is helpful:

Seeding Databases - hydrating a database with synthetic data for testing, debugging, feature development or demos
Staging Environments - many companies only create data to test happy paths within their applications and generally don't do a good job of catching edge cases. Creating synthetic data that models after production data is a great to way to catch more bugs before your push to production.
Training ML Models - high-quality data is one of the main bottlenecks for ML engineers in training models. Synthetic data can be used to augment an existing data set so that you have more data or to generate data from scratch to train a model. Generally, this synthetic data is in a tabular format while for developers, it's in a relational format.
Data Privacy Laws - many countries have data privacy laws to protect to PII from inappropriate use or access. Synthetic data can be used in place of production data to still give consumers of that data something to work with without the burden of security and privacy.

We're seeing new use cases come up for synthetic data all of the time as more attention and time is being spent on methods to generate higher-quality synthetic data for developers and ML engineers.

Tokenization

Tokenization is encryption's lesser known cousin! Tokenization is the process of converting a piece of data to another representation by using a look-up table of pre-generated tokens that have no relation to the original data. Another way to think about tokenization is to imagine a casino. When you go into a casino, you exchange cash for chips then use those chips in the casino and then when you're done, you swap the chips for cash again. Tokenization works in a similar way. You have some data that you give to a look up table, the look up table returns back a randomly generated token, then you can use that data until you're ready to swap it back for the original data.

The main difference between tokenization and encryption is that encryption is reversible from ciphertext -> plain-text while tokenization is not reversible. Meaning that if you have a token, there is no possible way to retrieve the original value without the look up table. No public or private key can reverse the data for you. This is why in some cases, tokenization can be more secure than encryption (at least theoretically). Encryption can be reversed with a key, tokenization cannot.

Lastly, similar to encryption, you can create different types of tokens that preserve the length, format and other characteristics of the input data. This is particularly useful for data processing such as lookups across databases.

Tokenization use cases

Tokenization is widely used to protect sensitive data in financial services and card networks and is starting to gain popularity in other use cases as well.

Card Networks - most card networks such as Visa, MasterCard, etc. tokenize your card number so that it's not floating around in clear text. Tokenization can happen at many different layers in the payment processing workflows and sometimes happens in multiple layers. The key is here is that the tokenizer (the party tokenizing the data), has access to the lookup table to convert the tokenized data to a card number that it can process.
Third party data sharing - tokenization can be used to protect data for third party data sharing. For example, you can tokenize certain columns in data set, such as name, age, email and then share that data set with an untrusted third party. They can process that data without being able to see the sensitive data, and then you can de-tokenize the sensitive data for your processing.
Securing PII data - tokenization can be a powerful tool for internal data tokenization to ensure that sensitive data is secure as it travels across your internal systems.

Compare and Contrast

Now that we've understood what encryption, tokenization and synthetic data are, let's look at the differences and similarities and better understand their use cases.

Feature	Synthetic Data	Tokenization
Input Format	Raw data from real datasets	Sensitive data (e.g., PII, payment info)
Output Format	New, non-real dataset mimicking original patterns	Non-sensitive token or identifier
Is Reversible	Not applicable (generates new data)	Conditional (secure lookup required)
Generation Method	Data modeling and algorithms	Mapping to unique tokens
Use-Cases	Seeding databases, bug bashing, machine learning training	Payment processing, data anonymization
Data Utility	High (preserves statistical properties)	Low to medium (depends on tokenization scheme)
Risk of Data Exposure	Low (no direct link to real data)	Low (tokens are not meaningful)
Regulatory Compliance	Can aid in compliance by avoiding use of real data	Often used for PCI-DSS, GDPR compliance

LLM data privacy

One of the main use-cases that we see with regards to data privacy and LLMs is the ability to anonymize or generate synthetic data for training or fine-tuning use-cases. In those situations a machine learning engineer can use either synthetic data or tokenization. What we know today is that data that is used to train LLMs is tokenized and vectorized before it's trained. This results in a high-dimensional graph where distances between tokens indict similarity.

For example, "queen" and "king" will be closer in distance in the graph than "queen" and "269745d2-1e30-4f98-beb3-3cace04769d2" or "queen" and "iosjdf09aufjw2". Why? Because semantically, queen and king have some relation while queen and UUID or queen and random string do not.

So what is important here is that we don't destroy the semantic meaning of the data that we want to anonymize/generate synthetic data for since that's how the model categorizes data!

Let's take a simple scenario. If I have a database with 3 columns and 3 rows of data that I want to protect as I train my model, it might look something like this:

email	first_name	last_name	age	height
john@doe.com	chris	doe	23	68
bill@frank.com	bill	frank	18	63
chris@johnson.com	chris	johnson	39	72

Let's take a look at what this looks like with synthetic data and with tokenization.

Synthetic data

Given that our goal is to protect sensitive data without losing the semantic meaning of that data, we can create synthetic data that looks just like it but is synthetic! Here's what it would look like:

email	first_name	last_name	age	height
evan@guill.com	evan	guill	28	65
joe@storm.com	joe	storm	19	64
larry@resan.com	larry	resan	36	71

We've updated our PII to generate new emails, first names and last names and then for the age and height columns we've generated a random value within a +/- range of 3. This data is brand new and does not leak any of our existing sensitive data. Semantically, it still makes sense. We've introduced a little more noise in the age and height columns but we have total control over that and how we update it.

That's the power of synthetic data. You're able to generate net new data that looks like our sensitive data and maintains generally the same semantic meaning but is privacy-safe.

Tokenization

Looking at tokenization, there are several different types of tokenization. Length-preserving and format-preserving tokenization are powerful features that mimic the format and length of the original data but the data itself is not retained. Let's see:

email	first_name	last_name	age	height
ewaqewe@wfeweas.fewfe	ioqw	adfce	91	19
8i778@qewf.wfe	qwd	eafs	23	45
7kj7k8@wdqoij.jwfeoi	qpzme	ewaw	12	01

Tokenization can still anonymize sensitive data however you will likely lose the semantic meaning of the data since the tokens it generates are typically nonsense. For some use-cases this might be fine, but every bit of noise you add to the model, you make the model less accurate. So there is a trade-off and tokenization tends to add more noise than synthetic data.

Wrapping up

Tokenization and synthetic data have similarities but also have key differences that are important for engineering and security teams to understand. As LLMs become more widely used, it's important that developers and machine learning engineers understand the tools they have available and when/how to use them. Generally, when interacting with machine learning models that care about semantic meaning, synthetic data is a great choice.