AI Training Under India’s DPDP Act: How Companies Can Use
Internal and Third‑Party Data Responsibly
The Digital Personal Data Protection Act, 2023 (DPDP Act)
has quietly but decisively redrawn the boundaries for how companies in India
can train artificial intelligence systems. Although the Act never uses the
phrase “AI training,” its consent‑centric, purpose‑limited architecture applies
squarely to every stage of model development. For any organisation building or
deploying AI—whether a startup, a platform, or a digital public infrastructure
(DPI) network—the DPDP Act is now the governing grammar.
This article unpacks what the law permits, what it
restricts, and how companies can navigate internal and third‑party data use for
AI training without falling into compliance traps.
1. The DPDP Principles That Directly Shape AI Training
Under the Act, AI training is simply another form of
“processing.” That means the full suite of obligations applies:
- Consent
or valid legal basis is required for processing personal data
- Purpose
limitation: data may be used only for the purpose stated at collection
- Data
minimisation: only necessary data may be processed
- Accuracy
and storage limitation
- Accountability
of the Data Fiduciary (the company determining purpose and means)
These principles are not abstract—they determine whether a
dataset can legally be fed into a model.
2. Using Internal (First‑Party) Data for AI Training
When it is generally permissible
Companies may train models on their own collected data if:
(a) The stated purpose includes AI or service improvement
If the privacy notice explicitly states:
“We use your data to improve our services, including training machine
learning models,”
then AI training falls within the consented purpose.
(b) The Act’s “legitimate uses” apply
Certain activities do not require explicit consent, such as:
- Fraud
detection
- Network
and information security
- Compliance
with legal obligations
AI models trained narrowly for these functions may rely on
these legitimate uses.
Where companies face compliance risk
1. Purpose creep
Data collected for a ride booking or payment transaction
cannot later be used to train a general‑purpose AI model unless the user was
informed upfront.
2. Excessive or irrelevant data usage
Training on full chat logs, behavioural histories, or
sensitive attributes may violate data minimisation.
3. High‑risk categories
While DPDP does not formally define “sensitive personal
data,” misuse of health data, children’s data, or financial information
attracts heightened scrutiny.
Best practices for internal data
- Anonymise
or de‑identify data before training
- Use
aggregated datasets wherever possible
- Maintain
data lineage and audit trails
- Conduct
Data Protection Impact Assessments (DPIAs) for high‑risk models
4. Using Third‑Party Data for AI Training
This is where the compliance terrain becomes significantly
more complex.
Scenario A: Purchased datasets
Permissible only if:
- The
vendor collected data lawfully
- Data‑sharing
agreements impose DPDP‑compliant obligations
- The
purpose aligns with the original consent
Risk: If the vendor scraped or collected data
unlawfully, the purchasing company is still liable.
Scenario B: Web‑scraped or publicly available data
A common misconception is that “public” equals “free to
use.” Under DPDP:
- Public
availability does not override purpose limitation
- Mass
scraping for general‑purpose AI training is legally risky
- Individuals
retain rights over their personal data even when posted online
Scenario C: Platform APIs and social media data
Companies must ensure:
- Platform
terms explicitly permit AI training
- Users
originally consented to downstream processing
Otherwise, the company risks violating both DPDP and
platform contracts.
5. Anonymisation as a Strategic Pathway
The DPDP Act applies only to personal data.
If companies can reliably convert data into:
- anonymised
datasets, or
- synthetic
data
then the Act no longer applies.
However:
- Weak
anonymisation can lead to re‑identification
- If
data can be linked back to an individual, it remains regulated
Anonymisation must be robust, irreversible, and documented.
6. Data Fiduciary vs Data Processor Responsibilities
- The
company training the model is typically the Data Fiduciary
- Cloud
providers or AI vendors are Data Processors
Fiduciaries must ensure processors follow:
- Purpose
limitation
- Security
safeguards
- Deletion
and retention obligations
Contracts must reflect these duties.
7. Cross‑Border AI Training
The Act allows cross‑border transfers except to restricted
jurisdictions (to be notified).
This means:
- Training
on global cloud infrastructure is permissible
- Companies
must ensure no onward processing beyond the stated purpose
Cross‑border AI workflows must be mapped and contractually
controlled.
8. Children’s Data and High‑Risk AI
DPDP imposes strict obligations:
- Verifiable
parental consent
- Prohibition
on tracking, profiling, or targeted advertising
Using children’s data for AI training is therefore high‑risk
and often impractical.
9. Enforcement and Penalties
Non‑compliance can trigger:
- Monetary
penalties up to ₹250 crore per breach category
- Orders
to cease processing
- Reputational
damage and loss of user trust
AI training pipelines must be designed with compliance baked
in, not bolted on.
10. Strategic Takeaway for Companies
Safe Zone
- Consent‑backed
first‑party data
- Anonymised
or synthetic datasets
- Narrow,
purpose‑specific models
Risk Zone
- General‑purpose
AI trained on user data without disclosure
- Scraped
datasets
- Purchased
data with unclear provenance
A simple rule of thumb:
If a user would not reasonably expect their data to train your AI model, you
likely need fresh consent.
11. Implications for Open Networks like ONDC and ION
Open networks introduce a unique governance challenge: data
flows across thousands of participants, each acting as a Data Fiduciary for its
own users while relying on shared protocols and registries. Under the DPDP Act,
this creates three structural implications:
1. Mission‑locked data boundaries become essential
ONDC and ION cannot allow network‑level data to be
repurposed for AI training unless:
- the
purpose is explicitly disclosed,
- each
participant has obtained valid consent, and
- the
network’s governance framework authorises such use.
Without this, any attempt to build network‑wide AI models
risks violating purpose limitation.
2. Distributed compliance means distributed liability
If one participant misuses data for AI training, the
liability does not remain isolated.
Network operators must therefore:
- enforce
strict data‑use covenants,
- mandate
provenance checks for third‑party datasets, and
- require
auditable anonymisation standards.
This is not optional, it is existential for trust.
3. Open networks can become global exemplars of privacy‑preserving
AI
If ONDC and ION embed:
- federated
learning,
- edge‑based
model training,
- synthetic
data generation, and
- network‑level
DPIAs,
they can demonstrate a third path distinct from the US “data
maximalist” model and the EU “data fortress” model.
They can show how open ecosystems can innovate without compromising individual
rights.
Conclusion
The DPDP Act does not prohibit AI training, it demands
discipline, transparency, and purpose integrity. Companies that treat data as a
privilege rather than an entitlement will thrive. For open networks like ONDC
and ION, the Act is not a constraint but an architectural opportunity: to build
AI that is decentralised, privacy‑preserving, and mission‑locked. If executed
well, India’s open networks could become the global reference model for
responsible AI in federated digital ecosystems.
“The future of AI won’t be
decided by algorithms—it will be decided by ethics.”

.jpg)




