Most data tools infer how to read your data. Here's why that's the root of every pipeline corruption I've seen.

By:Victoire Habamungu

I was building a data conversion pipeline that took exports from different systems and loaded them into a single MySQL database.

Somewhere in the middle of debugging yet another end-pipeline failure, I realized something that changed how I think about data systems entirely: if we had a single schema contract at the entry point, around 95% of the issues we were chasing downstream would never exist. Not even fixed, but never exists.

That's when I understood normalization wasn't just a step in the pipeline but a mandatory stage that everything else depended on. And every existing tool I found was getting it wrong in the same way.

Inference is not a feature. It is a liability.

Every time a tool looks at your data and decides what it means without being told, it is making a bet. And in data pipelines, wrong bets don't throw errors. They produce plausible-looking wrong output that travels silently downstream until something expensive breaks.

Here is what that actually looks like

Take a column called amount in a CSV export from a financial system. Here is what it actually contains:

amount
1,200.50
1.200,50
"1,200"
1200
N/A

Five rows. Five different representations of what might be the same value type, or might not be. One uses a comma as a thousands separator. One uses a comma as a decimal separator. One is quoted. One is a plain integer. One is a null disguised as a string.

An inference-based tool scans this column, picks a pattern that fits the majority (sometimes from a sample which may not be the majority in the column, another problem), and processes the file. It does not ask. It does not stop. It produces output.

What ends up in your database depends entirely on which pattern it chooses. If it picked the Anglo-Saxon convention, 1.200,50 becomes 1.2; this is not an error, not a null, just silently wrong. If it picks the European convention, 1,200.50 becomes 1.2005 or a string, null, or anything else, depending on how it resolves marker priority; also wrong and silent.

Now that the value is in your migration. It passed validation because it's a valid float. It passed your pipeline because nothing threw an exception. It sits in your database looking exactly like correct data.

Three months later, someone runs a reconciliation report. The numbers don't match. Nobody knows why. The data has been there for three months. Every downstream process that touched it produced slightly wrong results. The corruption didn't happen at the query; it actually happened at the import, silently, because a tool made a guess, but think about the coast then.

Why every existing tool still does this

Because inference works on clean data. And most demos use clean data.

When your columns are consistent, your types are homogeneous, and your source is predictable, inference is fast, and it mostly gets it right. The problem is that most systems actually produce data that is none of those things.

The deeper problem is that you don't control the source. You don't know what schema you'll receive tomorrow. A column that held integers last week holds mixed types this week. The source changed without telling you. And your tool, inferring from what it sees today, has no way to know what changed or why.

Covering infinite patterns is the wrong solution. You cannot enumerate every way data can be wrong. You can only eliminate the guessing.

The only way out

Put a human in the middle. Before the pipeline processes a single row, someone has to tell the system what the data means, not what it looks like, but what it means. What type does the amount column actually hold? What decimal conventions does this source use? Mixed or not, how should it resolve any appearing conflicts? What should happen when a value is N/A, null, zero, or reject the row entirely?

This is not manual cleaning. Manual cleaning is reacting to mess after the fact, row by row, with formulas that break the moment the data changes. This is a contract defined once, applied consistently, that makes the system's behavior explicit and predictable regardless of what arrives.

The cost of that contract is one confirmation step per ingestion. The cost of skipping it is data you cannot trust.

What AI produced and what I changed

When I started building Normalize, I used AI to help think through the schema detection approach. It suggested locale-aware inference, smarter guessing, essentially. Better pattern recognition, more sophisticated type detection, and enhanced heuristics to handle ambiguous values. Everything in that suggestion was designed to make inference more accurate.

It never questioned whether inference was the right approach in the first place.

That's the gap. AI optimized the wrong thing. The problem was never that inference wasn't smart enough; no, it was that inference is structurally wrong for unpredictable data sources, regardless of how smart it gets. A locale-aware tool would still guess that 1.200,50 is European notation. It would just guess with more confidence.

I rejected the entire direction and replaced it with the confirmation step. Not smarter guessing, just no guessing at all. That's the decision AI couldn't make because it was solving the problem I gave it, not questioning whether it was the right problem to solve.

That's why I built Normalize.

Every tool I evaluated was inferring. Some were smarter about it than others. None of them solved the fundamental problem that inference is a guess, and guesses corrupt data.

Normalize introduces a mandatory hold-on step. Data comes in, it infers and suggests its thoughts about structure and formats, the pipeline stops, you review its understanding of your data, correct where it got wrong, or add what it could not see, and tell it how the data should be interpreted and how it should come out. Nothing moves until that contract exists. Normalize never guesses.

It's open source. If you want to run it locally, integrate it into your own pipeline, or use it as the normalization engine inside your own system, the core is available on GitHub. If you want to use it directly without setup, it's live at normalizeonline.com. CLI support is available for teams who want to plug it into existing workflows without touching the UI or running it on local files.

If you work with data you don't control the source; it was built for exactly that.

Victoire Habamungu

Software engineer specialising in data systems, distributed architecture and platform engineering.

Keep reading

Most data tools infer how to read your data. Here's why that's the root of every pipeline corruption I've seen.

Here is what that actually looks like

Why every existing tool still does this

The only way out

What AI produced and what I changed

That's why I built Normalize.

Victoire Habamungu

Share post:

More on web development

Every Django team solves the same DRF problems from scratch. They shouldn't have to.

Some integrations don't send you updates. They send you noise. Reconstructing the truth from it.

Article details

In this article