2 min read

The Data Twin Dilemma: You Fed it What?

Nick Pollard : Jun 25, 2025 5:21:50 PM

EMEA Lightning IQ

How can you trust your Digital Twin if the data feeding it is unreliable?

I’ve used discovery technology for years to uncover what organisations didn’t want to see. Legacy PST files, forgotten file shares, redundant records held long past their legal expiry. Data that was duplicated, misplaced, misclassified, or flat-out misleading. It’s never clean, and frankly, it’s rarely trusted.

Recently, I have been involved in the preparation of data for a Digital twin. Knowing what sort of data risk (mess) is in Unstructured data made me really nervous that the project was flawed from the start.

Digital Twins look impressive. Animated, responsive, interactive. They simulate storms, market crashes, disaster recovery, supply chain delays, you name it! They can produce dashboards, forecasts, and perhaps even some regulatory comfort. But they only work if the data feeding them is clean, current, and complete and most of the time, let's face it, it’s none of those things.

The simulation is built on ROT and shadows

This isn’t theory. I’ve spent most of my career in enterprise data estates, and I know the truth: around 30% of what’s sitting in unstructured storage is ROT - Redundant, Outdated, or Trivial. A further 50% or more is dark i.e. no one knows what it is, who owns it, or why it's there.

That means up to 80% of your Digital Twin’s inputs could be data junk. Data you don’t need. Data you haven’t reviewed. Data that introduces noise, bias, risk, and cost.

So how exactly do you expect the Twin to give you the truth, when most of what it’s learning from is either irrelevant or unknown?

Six brutal questions your twin can’t answer

Before you put your trust in a Twin, ask:

Where did this data come from? If your answer involves a shared drive and a shrug, you don’t have provenance. You have a risk.
Is this data still alive? Real-time simulation needs real-time inputs. If your feeds are cold, your predictions are frozen in the past.
Who owns this data? If you don’t know who maintains it, you don’t know who’s responsible when it goes wrong.
What’s missing? Data doesn’t warn you when it’s incomplete. But your model will assume it’s the full picture.
Duplicates? Your storage may well do block deduplication but that doesn't stop logical duplication. How much bloat are you about to ingest?
Why do we trust this? Because the dashboard is pretty? Or because the data has been audited, validated, and classified?

Lightning IQ – Understand and de-risk data at scale

Lightning IQ doesn’t build the Twin. It makes sure the data going into it is actually fit for purpose.

We scan unstructured data at scale, petabytes of it. We classify, deduplicate, trace lineage, detect ROT, flag sensitive content, and expose blind spots. We tell you what should never be allowed into your planning models. And we do it fast enough that your project doesn’t stall while we audit.

In short, we help you earn the right to trust your Twin.

By doing this, you essentially make the Digital Twin the superior data set. Clean, clear, data-aware. So perhaps now is the time to turn your attention back to the enterprise data set. What are you missing? What could you fix now, not just simulate later?

Before you ask what the twin can predict, ask this:

What makes you so sure it deserves your trust and can you prove that every dataset it consumes is accurate, current, complete, secure, and legally compliant, with traceability back to source?

Because if you can’t answer that, your Twin isn’t modelling the future. It’s recycling the past.

Nick Pollard is a Managing Director (EMEA) for One Discovery. He is a seasoned leader with more than 20 years of experience working in real-time investigation, legal and compliance workflows across highly regulated environments like banking, energy and healthcare as well as national security organizations. You can contact at nick.pollard AT onediscovery.com