IIITH explores AI-driven Education Data Models for Schools

A IIITH Raj Reddy Center for Technology and Society (RCTS) curated Roundtable on ‘Common Education Data Models for Schools’ brought together key stakeholders in the largely unwieldy education sector. They were tasked with the quantitative categorization of problems with the end view of building AI/emerging technologies-based solutions to specific issues.

‘If you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid’. Building on this premise, elements reviewed at the roundtable included the processes for creating common data models for existing demographic and education assessment data, to aid different levels of school management. The discussion threw up the different types of data currently being gathered, prevailing challenges in implementing school-developed solutions and explored data-based insights and technology solutions to aid the administration and policy-makers.

Moderated by Dr. Rohit Kandakatla of KGRCE, the closed door deliberations on 6th July brought together key stakeholders, comprising of grassroot NGO Care India, pedagogy experts, technologists, researchers and innovators from Talent Sprint, Key Stone Foundation, Creya Learning, Infosys Springboard, ConveGenius, TCS-ION and AIC-IIITH.

The existing mode of school data collection
The two largest datasets currently available to researchers at the cluster/district/national level is the Annual Status of Education Report (ASER) and National Achievement Survey (NAS). A smattering of smaller scale data collection initiatives at the individual level include the State Achievement Survey, School Application Data and School Scheme data. Data is collected at three levels – students, demographics of parents and student-teacher attendance. At the administration level, information on support programs like the ‘National Free Food program for students (NP-NSPE), school infrastructural facilities as well as continuous periodic Assessments is collated.

Challenges in the current data collection systems
Challenges in data collection efforts were bucketed into three distinct phases. The conceptualization phase looked at challenges in communicating the need for data collection, to all entities down the pipeline. Data systems have to be integrated into the daily workflow, along with systemized mandates and vision communication. The Generation and data entry phase spotlighted the challenges in standardizing the data collection and storage procedures. The Utilization phase looked at appropriate analytics and indicators that should be made available for use by entities at all levels.

Addressing Missing Links in Data Collection – In the highly complex education domain, issues and their interdependencies are not easily identifiable. Current top down solutions only collect data for known problems. A more holistic approach would require comprehensive datasets monitored at a longitudinal level, to study the efficacy of different intervention efforts and for better resource-allocation while keeping an eye out for unknown issues or the repercussions of intervention programs.

Data recording should ideally cover both government and private school systems, for a broader perspective of how different levels of resources are contributing to student education and well-being and to tweak policy level efforts to address administrative discrepancies. However, private schools operate in a separate bubble and the spectrum of data considered important by them is quite different. For instance, health data collected under “Ayushman Bharat’s School Health Programs” apply only to government schools.

Mental health and physical impairments not factored in – Parameters such as mental health or visual, auditory, speech or other physical impairments are not being factored in while designing digital content programs.

Lacuna of longitudinal studies of student development – Grades are currently the chief benchmark to define a student’s performance annually, which don’t take into consideration the potential of good students while ignoring the needs of poor performers. A longitudinal approach to student development would require standardized recording, maintenance and accessibility of student performance data and corresponding analytics, over multiple years and across schools.

Importance of a Micro & Macro Approach – To capture actual learning data that the current NEP policy could utilize, data collection processes must be standardized at the micro level (individual student/teacher), and data aggregation and analytics dissemination processes must be streamlined at a macro level (state/central). Third party validation & auditing of data at multiple stages of the collection process will verify the veracity of the extracted information and would reflect the true picture.

Lack of standardized data linking protocols between agencies – The education space faces problems of interoperability and compatibility, with multiple agencies in the community, capturing different datasets. Since the datasets are not designed to be linked in a standardized fashion, it diminishes its usability.

Roadblocks to successful deployment of curated solutions
Multiple data collection systems address different problems with their own sets of goals and constraints, collected by personnel with varying levels of training, tools, access and incentives, making it difficult to compare even common features in the different datasets.

Reliability of the dataset – One main hurdle envisaged was the authenticity of the dataset. Lack of simple validity and logic checks to prevent deliberate data manipulation at the time of collection as well as workloads of the personnel, especially teachers were seen as red flags, leading to non-deliberate mistakes in recording data. Panelists recommended standardization in data collection and storage, to be strictly enforced through automated validity checks and periodic third-party audits along with employing competent personnel for data collection.

Unqualified human resources – Panelists identified lack of training in the usage of data collection tools, improper communication of the importance of the task at hand, lack of incentives and deliberate malpractice as factors affecting data quality and reliability. The absence of properly defined objectives of who, why and how the data is collected and handled was seen as the main reason behind systems being designed with little thought given to interoperability and the failure of unified systems over the years.

Data access and privacy – Data access and privacy become significant issues since individual performance data is being shared across organizations. Strict adherence to access control and anonymization tools should be built into system design at the conceptualization phase, with the protocol for usage being standardized at a policy level to ensure widespread compliance.

Lost in translation – To accommodate India’s diversity, data collection systems are generally designed to work in multiple languages, that have scope for misinterpretation due to language discrepancies. Therefore, any multi-language data collection system design must clearly demarcate possible confusion areas and supply supportive clarification questions to data collectors and analyzers.

Leveraging Data-driven insights and tech solutions
Data entry should be daily practice, invested with anonymity and gamification in the telemetry and learning process to improve and lower the friction of student participation. Certain processes need to be built in, like NDEA (National Digital Education Architecture) certification for operating systems, HIPAA for data portability and regionalizing content and systems.

Digitization and automation of data collection processes – Before building new tech solutions, the first task would be to digitize the current data collection and storage processes with minimal friction for implementation. The developed systems must satisfy three major criteria; consistency, frequency and dynamicity. Access to internet connectivity, alternative avenues to handle the lack of resources and organically onboarding alternate data streams should be in place for widespread adoption of digitization solutions.

Seamless, user friendly and politically agnostic solutions – The lowest skill level of the end user should be factored in when designing technology solutions. It should be user-friendly to the least tech savvy, with minimal training required for onboarding data collectors, analyzers and management. Politically agnostic solutions will serve the dual purpose of utilizing political will as a tailwind to expand adoption of the system while avoiding the pitfall of the system being discarded with a change in administration.

Cross-pollination of data collection efforts – Adopting best practices from other successful large data collection efforts, onboarding their systems to allow for quick access to larger populations and looking beyond education for cross pollination were also considered.

Personalized Learning Solutions – A key issue endemic to the current education system is its “One size fits all” nature. Student assessment information should be expanded to allow for tracking personal learning beyond simple scores. Automated learning recommendation engines could be driven by learning levels rather than age criterion. Solutions design must collect data to see how each student is engaged in a class, to allow for better content design and suggestions. These metrics will help in personalized learning recommendation and as teaching aids to monitor learning efforts.

The road ahead
The Center has collaborated with a school-complex cluster to pilot technology interventions and explore possible tech solutions for teaching tools, designing holistic progress reports and formative assessments. Going forward, the center will examine the current data sources and build models that will churn data from different sources and end-users during development.

The Centre proposes to start the model-building process, based on the blueprint of requirements and challenges that the roundtable highlighted and aims to develop a system, with backend model flexibility to allow for addition of features. Algorithm development for projection and prediction in the proposed information management system will feature real time dashboards, catering to the different parties in the data pipeline, to encourage individual usage of the deployed systems.