Pages

Friday, April 17, 2026

AI Training Under India’s DPDP Act: How Companies Can Use Internal and Third‑Party Data Responsibly

 




AI Training Under India’s DPDP Act: How Companies Can Use Internal and Third‑Party Data Responsibly

The Digital Personal Data Protection Act, 2023 (DPDP Act) has quietly but decisively redrawn the boundaries for how companies in India can train artificial intelligence systems. Although the Act never uses the phrase “AI training,” its consent‑centric, purpose‑limited architecture applies squarely to every stage of model development. For any organisation building or deploying AI—whether a startup, a platform, or a digital public infrastructure (DPI) network—the DPDP Act is now the governing grammar.

This article unpacks what the law permits, what it restricts, and how companies can navigate internal and third‑party data use for AI training without falling into compliance traps.

1. The DPDP Principles That Directly Shape AI Training

Under the Act, AI training is simply another form of “processing.” That means the full suite of obligations applies:

  • Consent or valid legal basis is required for processing personal data
  • Purpose limitation: data may be used only for the purpose stated at collection
  • Data minimisation: only necessary data may be processed
  • Accuracy and storage limitation
  • Accountability of the Data Fiduciary (the company determining purpose and means)

These principles are not abstract—they determine whether a dataset can legally be fed into a model.

2. Using Internal (First‑Party) Data for AI Training

When it is generally permissible

Companies may train models on their own collected data if:

(a) The stated purpose includes AI or service improvement

If the privacy notice explicitly states:
“We use your data to improve our services, including training machine learning models,”
then AI training falls within the consented purpose.

(b) The Act’s “legitimate uses” apply

Certain activities do not require explicit consent, such as:

  • Fraud detection
  • Network and information security
  • Compliance with legal obligations

AI models trained narrowly for these functions may rely on these legitimate uses.

Where companies face compliance risk

1. Purpose creep

Data collected for a ride booking or payment transaction cannot later be used to train a general‑purpose AI model unless the user was informed upfront.

2. Excessive or irrelevant data usage

Training on full chat logs, behavioural histories, or sensitive attributes may violate data minimisation.

3. High‑risk categories

While DPDP does not formally define “sensitive personal data,” misuse of health data, children’s data, or financial information attracts heightened scrutiny.

Best practices for internal data

  • Anonymise or de‑identify data before training
  • Use aggregated datasets wherever possible
  • Maintain data lineage and audit trails
  • Conduct Data Protection Impact Assessments (DPIAs) for high‑risk models

4. Using Third‑Party Data for AI Training

This is where the compliance terrain becomes significantly more complex.

Scenario A: Purchased datasets

Permissible only if:

  • The vendor collected data lawfully
  • Data‑sharing agreements impose DPDP‑compliant obligations
  • The purpose aligns with the original consent

Risk: If the vendor scraped or collected data unlawfully, the purchasing company is still liable.

Scenario B: Web‑scraped or publicly available data

A common misconception is that “public” equals “free to use.” Under DPDP:

  • Public availability does not override purpose limitation
  • Mass scraping for general‑purpose AI training is legally risky
  • Individuals retain rights over their personal data even when posted online

Scenario C: Platform APIs and social media data

Companies must ensure:

  • Platform terms explicitly permit AI training
  • Users originally consented to downstream processing

Otherwise, the company risks violating both DPDP and platform contracts.

5. Anonymisation as a Strategic Pathway

The DPDP Act applies only to personal data.
If companies can reliably convert data into:

  • anonymised datasets, or
  • synthetic data

then the Act no longer applies.
However:

  • Weak anonymisation can lead to re‑identification
  • If data can be linked back to an individual, it remains regulated

Anonymisation must be robust, irreversible, and documented.

6. Data Fiduciary vs Data Processor Responsibilities

  • The company training the model is typically the Data Fiduciary
  • Cloud providers or AI vendors are Data Processors

Fiduciaries must ensure processors follow:

  • Purpose limitation
  • Security safeguards
  • Deletion and retention obligations

Contracts must reflect these duties.

7. Cross‑Border AI Training

The Act allows cross‑border transfers except to restricted jurisdictions (to be notified).
This means:

  • Training on global cloud infrastructure is permissible
  • Companies must ensure no onward processing beyond the stated purpose

Cross‑border AI workflows must be mapped and contractually controlled.

8. Children’s Data and High‑Risk AI

DPDP imposes strict obligations:

  • Verifiable parental consent
  • Prohibition on tracking, profiling, or targeted advertising

Using children’s data for AI training is therefore high‑risk and often impractical.

9. Enforcement and Penalties

Non‑compliance can trigger:

  • Monetary penalties up to ₹250 crore per breach category
  • Orders to cease processing
  • Reputational damage and loss of user trust

AI training pipelines must be designed with compliance baked in, not bolted on.

10. Strategic Takeaway for Companies

Safe Zone

  • Consent‑backed first‑party data
  • Anonymised or synthetic datasets
  • Narrow, purpose‑specific models

Risk Zone

  • General‑purpose AI trained on user data without disclosure
  • Scraped datasets
  • Purchased data with unclear provenance

A simple rule of thumb:
If a user would not reasonably expect their data to train your AI model, you likely need fresh consent.

11. Implications for Open Networks like ONDC and ION

Open networks introduce a unique governance challenge: data flows across thousands of participants, each acting as a Data Fiduciary for its own users while relying on shared protocols and registries. Under the DPDP Act, this creates three structural implications:

1. Mission‑locked data boundaries become essential

ONDC and ION cannot allow network‑level data to be repurposed for AI training unless:

  • the purpose is explicitly disclosed,
  • each participant has obtained valid consent, and
  • the network’s governance framework authorises such use.

Without this, any attempt to build network‑wide AI models risks violating purpose limitation.

2. Distributed compliance means distributed liability

If one participant misuses data for AI training, the liability does not remain isolated.
Network operators must therefore:

  • enforce strict data‑use covenants,
  • mandate provenance checks for third‑party datasets, and
  • require auditable anonymisation standards.

This is not optional, it is existential for trust.

3. Open networks can become global exemplars of privacy‑preserving AI

If ONDC and ION embed:

  • federated learning,
  • edge‑based model training,
  • synthetic data generation, and
  • network‑level DPIAs,

they can demonstrate a third path distinct from the US “data maximalist” model and the EU “data fortress” model.
They can show how open ecosystems can innovate without compromising individual rights.

 

Conclusion

The DPDP Act does not prohibit AI training, it demands discipline, transparency, and purpose integrity. Companies that treat data as a privilege rather than an entitlement will thrive. For open networks like ONDC and ION, the Act is not a constraint but an architectural opportunity: to build AI that is decentralised, privacy‑preserving, and mission‑locked. If executed well, India’s open networks could become the global reference model for responsible AI in federated digital ecosystems.

“The future of AI won’t be decided by algorithms—it will be decided by ethics.”


No comments:

Post a Comment