Building an AI-Powered Data Quality Profiler (That Runs Locally)

“Profiling data, explaining problems, and suggesting fixes — all offline, all in code."

Posted Jul 30, 2025

By Sneha Shrivastav

1 min read

📌 Why I Built This

Data quality checks are often manual and tedious. While tools like Great Expectations help define rules, I wanted something more exploratory — something that would:

Profile any CSV quickly
Explain issues in plain English
Suggest meaningful fixes
Run completely offline, using small open LLMs

That’s how this project started.

⚙️ What It Does

This tool takes a CSV file and:

Profiles it using ydata-profiling
Extracts a data summary (missing values, types, etc.)
Uses a local LLM (via Ollama) to:
- Explain data quality issues
- Suggest potential cleaning strategies
Embeds that explanation into the final profiling report (HTML)

No cloud, no OpenAI key, no dependencies on heavy GPU models.

📁 Project Structure

  
ai-data-quality-profiler/
├── app/
│   ├── main.py                  # CLI entry point
│   ├── dq_profiler.py           # Handles profiling & report update
│   └── ai_explainer.py          # Sends summary to local LLM
├── data/                        # Input datasets
├── reports/                     # Raw profiling reports
├── docs/                        # GitHub Pages reports (with AI insights)
├── requirements.txt
└── README.md

🛠️ How It Works

1. Data Profiling

  
from ydata_profiling import ProfileReport

report = ProfileReport(df, title="Telco Data Quality Profile", explorative=True)
report.to_file("reports/profile.html")

2. Summarize for the LLM

  
summary = extract_summary_for_llm(df)
# Output like:
# Column: TotalCharges, Type: object, Missing: 11, Unique: 6530

3. Ask a Local Model for Insight

  
response = ollama.chat(
    model="tinyllama",
    messages=[{"role": "user", "content": prompt}]
)

4. Embed Results into HTML Report

  
# Uses BeautifulSoup to inject AI output
soup.body.append(<div>AI Results</div>)

📊 Sample Output

At the bottom of the HTML profiling report, you’ll see:

AI-Generated Analysis: TotalCharges has missing values and is stored as a string. PhoneService has low variance…

Suggested Fixes: Convert TotalCharges to float, fill missing with median, consider dropping low variance columns…

💻 How to Run

ollama run tinyllama
python app/main.py
xdg-open docs/index.html

Everything runs locally. The HTML report opens in your browser, fully self-contained.

🌱 What’s Next

Auto-apply suggested fixes
Clean vs raw dataset comparison
Streamlit-based UI for easier use
Add Great Expectations validation layer

🔗 Repo

Check it out here: github.com/sneha-dq/ai-data-quality-profiler

data-profiler

data-quality ai python llm data-profiler ydata-profiling ollama open-source local-llm tinyllama pandas

This post is licensed under CC BY 4.0 by the author.