Introduction
Text and Data Mining (“TDM”) refers to an automated set of processes that scan, extract, and analyse a vast amount of digital content, including all forms of text, audio, images, and structured and unstructured data, to find patterns, relationships, or valuable information.[i] Artificial Intelligence systems heavily hinge on TDM, especially for the training of Large Language Models (“LLMs”), generative AI tools, sentiment analysis algorithms, and predictive engines. There is no alternative way for AI to learn, improve, and perform at scale without access to large datasets. Hence, TDM really forms the basis for any data economy, particularly occurring in healthcare, finance, law, education, and digital marketing.
Since the development of AI relies on bulk data scraping from websites, social media, databases, and digital platforms, the legality and ethics of such scraping practices are now more relevant than ever. When personal data is mined, either knowingly or unwittingly, without the knowledge or consent of the data subject, privacy issues may become very pronounced. That is when robust data protection frameworks are called for.
The Digital Personal Data Protection Act, 2023 (“DPDP Act”), is primarily India’s first comprehensive approach to addressing various facets related to the collection, storage, and processing of personal data. The Act aims to empower individuals while ensuring the responsible use of data, and in doing so, elaborates on notice, consent, purpose limitation, data administration, and data fiduciary obligations. The said Act is, curiously, silent on the issue of TDM; that is, it neither clearly indicates whether TDM activities fall within the domain of lawful processing nor delineates when/how they engage data protection rights. Tunnelled into the pockets of the secondary victims are specific categories of data, which have also been revealed in a thick veil: sensitive personal data, personal data, and, to some extent, nonpersonal data. This Article attempts to explore this legislative gap. It examines whether the existing privacy law in India is sufficient to govern the rapid evolution of AI and TDM, or whether it requires a more tailored legal framework.
TDM under the EU Copyright Directive
Recognising the critical role that TDM plays in modern technical innovation, the European Union (“EU”) codified its regulation under the TDM Directive. Recital 8 of the DSM Directive defines TDM as “any automated analytical technique aimed at analysing text and data in digital form to generate information such as patterns, trends and correlations”.
Articles 3 and 4 of the DSM Directive have introduced certain vital exceptions to traditional concepts of copyright by permitting TDM activities for scientific research purposes and general purposes. At the same time, rights holders are also allowed to opt out of the exploitation of their works for TDM activities by reserving their rights.[ii]
While the DSM Directive permits TDM for scientific research and general purposes, it does not specify provisions regarding copyright ownership in outputs generated by AI systems once training has been completed. The protection of these outputs continues to be governed by traditional concepts of authorship and originality as laid down.
GDPR and Data Scraping
Internet-based data processing related to personal data is governed by the GDPR, if applicable, in the EU.[iii] Any processing of data requires a lawful ground for doing so, which may be consent, legitimate interest, or legal necessity. Insofar as personal data is involved in TDM (for example, mining social media posts, product reviews, digital health records), GDPR would apply. The data subject needs to be informed, and their data should not be processed without lawful grounds.
Hence, the Copyright Directive provides for the legality of using protected works, while the GDPR safeguards the lawfulness of personal data processing through TDM, with an additional compliance layer.
TDM and the DPDPA, 2023: The Legal Gap in India
India’s Digital Personal Data Protection Act, 2023 (DPDPA)[iv] emphasizes a consent-based approach for the processing of personal data, meaning that notice and consent should be given, with permission being explicit from the data principal before initiation of personal data processing, and other key principles such as purpose limitation, in that the use of data is allowed only for the specified purpose for which data was collected, and in lawful processing, under which use of data is permitted only where such use is allowed by constitutional or legal provisions such as in employment, necessity of law, or functions of State. While addressing privacy issues with a formal and structured approach, the Act fails to provide for TDM, web scraping, and automated data collection for machine learning, keeping these activities in a legal grey zone. This raises questions about whether it is lawful to scrape public personal data, such as LinkedIn bios or tweets, without consent, and whether computer-generated, machine-readable mass opt-outs, recognised in the EU, would be legitimate in India. Unlike the EU Copyright Directive, the DPDPA would not carve out exceptions for AI research or development, hence breaking legal uncertainty for developers and data scientists, who may inadvertently risk violating privacy norms.
Privacy Risks and Legal Uncertainty in Unregulated TDM
Unregulated use of training data mining techniques poses severe risks to privacy and surveillance.[v] These TDM engines enable the covert extraction of sensitive personal data, which is then profiled and algorithmically discriminated against by assuming the persons are relevant to their religious beliefs, political views, or sexual orientation, thereby infringing on informational privacy. The lack of any mechanism for transparency means that data subjects have no way to inquire about the nature of their data’s use, withdraw their consent, or exercise their rights as data principals, thus rendering them vulnerable as AI developers continue to operate in unchecked shadows. In India, uncertainty creeps in for AI companies about compliance; no clear-cut legal framework stipulates under what conditions Indian data used for model training falls under a set of prescribed obligations-because it is unclear whether the unauthorized scraping or mining of public data constitutes “processing” under the DPDPA, whether it requires consent for public data, and whether it excludes anonymized data from its ambit.[vi] This uncertainty hinders transparent AI development, stifles ethical growth, and presents further risks of legal challenges.
The grave concerns regarding generative AI relate essentially to systems that utilise TDM techniques on a large scale. At the same time, privacy rights may be violated and intellectual rights infringed upon whenever such methods are applied to personal data without consent or attribution.[vii] Therein lies a dichotomy between India’s ambition of creating an AI hub and the need to protect digital rights. An unregulated TDM, while hurrying innovation, certainly build distrust on the public level and limits legal recourse. Important legal and policy questions arise, such as: Would TDM of personal data violate DPDPA’s principles of consent and purpose limitation? Certainly yes, if it is done without informing or obtaining consent from the data subject, for their data shall be used for purposes outside of that for which it was collected.[viii] Currently, there is no express statutory or judicial regime governing TDM in India; therefore, developers have been working either under assumed permissions or in a legal grey area where public data is considered fair game. While a broad definition of data processing under DPDPA theoretically may include TDM, it does not address its unique characteristics, such as bulk scraping or automated harvesting, so separate regulatory recognition or guidance is necessary for clarity and compliance.[ix]
Comparative silence on TDM between India and the EU
Under the EU Copyright and Data Protection regimes, the European Union has explicitly granted TDM status and protection, with the EU Copyright Directive laying significant emphasis on the machine-readable opt-out provision provided to rights holders for the sake of legal certainty and a well-defined framework for legitimate data mining.[x] In contrast, India is still lagging in specifying alternative terms, safeguards, or mechanisms through which compliance with TDM activities can be established, thereby leaving stakeholders with little guidance on what exactly is a permissible practice. Such a regulatory vacuum has significant implications for the governance of AI and data protection: the absence of explicit safeguards makes it possible for TDM operations in India to be transformed into mass surveillance activities, unauthorised data exploitation, and violations of intellectual property rights. The vacuum also casts a shadow of doubt on foreign companies dealing with Indian data, thereby undermining the edifice of trust and accountability in AI development and potentially discouraging their responsible innovation.[xi]
Recommendations and Conclusion
The rapid integration of Artificial Intelligence (AI) into various sectors in India necessitates a legal framework that promotes innovation while also safeguarding individual rights. Though the Digital Personal Data Protection Act, 2023 (DPDPA) provides the basic structure of data protection, little is said under the existing law about TDM, a glaring omission considering the emerging prominence of TDM in the contemporary world. Explicit recognition and regulation of TDM, more so where the processes involve the use of personal or publicly available data, are non-existent in India.
There is an urgent need for explicit recognition of TDM within India’s legal framework, which permeates civil, regulatory, and criminal law. This can be affected in two possible ways: firstly, by prescribing detailed provisions relating to automated data scraping and mining under the yet-to-be-notified DPDPA Rules, and secondly, by promulgating a specific AI law to address the emerging challenges in machine learning, model training, and data ethics.[xii]
A responsible data mining framework should include the following:
- Mandatory disclosures on TDM operations by AI developers.
- Consent or opt-out possibilities to data subjects, mainly when personal data is mined from public platforms.
- Specific safeguards for sensitive data coming from different sectors (e.g., health, financial, biometric).
- Clear definitions of what constitutes anonymised, pseudonymized, and identifiable data in TDM contexts.
- In sync with international norms, like opt-out schemes and exceptions from TDM provided in the EU.
Clearly, India must bridge the existing gap on the TDM front if it desires to be the ethical and trustworthy AI hub in the world. Currently, in the absence of any statutory instruments, the existing framework provides no protection for digital rights and offers no practical clarity on compliance to developers. Explicit regulation of TDM is not only desirable, it is essential for balanced AI governance in India.
References:
[i] Martin Husovec, ‘The Magic of Technology-Specific Legislation: Copyright Exceptions for Text and Data Mining (TDM)’ (2018).
[ii] European Data Protection Board, ‘Guidelines 2/2019 on the Processing of Personal Data under Article 6(1)(b) GDPR’ (2019).
[iii] Lilian Edwards and Michael Veale, ‘Slave to the Algorithm? Why a “Right to an Explanation” Is Probably Not the Remedy You Are Looking for’ (2017).
[iv] Digital Personal Data Protection Act, 2023, Act No. 22 of 2023.
[v] Malavika Raghavan, ‘Consent in the Age of AI: India’s Missed Opportunities in the DPDPA’ (2024) Indian Journal of Law & Tech 21(1) 33.
[vi] Sandra Wachter, Brent Mittelstadt and Luciano Floridi, ‘Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation’ (2017) 7 International Data Privacy Law 76.
[vii] Graham Greenleaf and Bertil Cottier, ‘Artificial Intelligence and Data Protection: Challenges and Regulatory Perspectives’ (2020) 6(2) Computer Law Review International 45.
[viii] Anirudh Rastogi, ‘Is AI Compliant with India’s DPDPA?’ (2023) The Leaflet https://theleaflet.in accessed 10 July 2025.
[ix] Samir Mathur, ‘Training AI Models: Ethical and Legal Limits in the Indian Context’ (2024) NLU Delhi Law Review 12(2) 99.
[x] Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market, art 3–4.
[xi] Arghya Sengupta, ‘Regulating AI in India: The Case for Specific Legislation on TDM’ (2023) 18(2) NUJS L Rev 45.
[xii] NITI Aayog, Responsible AI for All: Strategy for India (2021).