Reading the Whole Story that Data Wants to Tell

By: Yihan Zhao

Thursday, Aug 25, 2016

Recent education leadership research has shown that the emerging domain of data driven decision making (3DM) helps practitioners, researchers and policymakers use data already collected in schools to make more effective education and resource allocation decisions. The success of the 3DM strategy depends on not only the quality of data collection but the way to interpret the results. In schools, there are two main streams of academic student data—teacher-assigned grades and standardized tests reported to administrators. With these information on hand, most researchers use the central tendency as a statistical way to interpret student data, such as means and standard deviations. These types of descriptive methods are good for creating an overview of the overall trend data through aggregating the information. However, these techniques that describe the basic features of the data with simplified summaries lose the richness of the data at the same time. Using these descriptive methods, it is not possible to describe the individual student data for hundreds or thousands of students, or visualize individual trends and patterns in the data using conventional statistics methods, such as t-tests, correlations, or regressions.

A Cluster Analysis Heatmap (HCA) codes data points from numbers to different color blocks that enables one to visualize the pattern of numbers, where hotter colors represent higher scores and cooler colors represent lower scores. Cluster analysis heatmaps have been used in multiple research domains for decades, and became popular over the last 15 years in bioinformatics and molecular biology. This big data visual analytic technique, patterns and visualizes high dimensionality data to provide a visualization that displays individual trends and correlations, patterned together with similar individual data patterns.

In a research article in 2010 in the journal Practical Assessment Research and Evaluation, Alex J. Bowers adapted the HCA heatmap technique from bioinformatics using hierarchical cluster analysis (HCA) and heatmap together to pattern and interpret the trends in longitudinal student data. A cluster analysis heatmap consists of three main sections. First, the center part displays the standardized scores for each student’s grades assigned by teachers in each subject at every grade level with intense blue representing lower grades, gray representing average grades, intense red representing higher grades, and white representing missing data. The order on the X-axis in the clustered heatmap is from lower grade level to higher grade level, for example from elementary school to high school. Second, the left side of the heatmap presents the clustering pattern that shorter length indicates more clustering similarities, and vice versa. Third, the right side lists the categorical variable for each student with a black bar indicating the presence of the variable, such as demographic variables and dichotomized outcomes. The reason to use the teacher-assigned grade is because grades are well-known to predict overall student outcomes. To illustrate the method, Bowers analyzed 188 students’ grades recorded from kindergarten through grade 12 in June of 2006 to predict the likelihood of students dropping out of high school. He included missing data in the analysis as the pattern of student school transfers, course taking patterns across different grade-levels and subjects, and high school dropping out can create a somewhat sparse data matrix. To address this issue, Bowers used the average linkage clustering algorithm, which calculates the mean pairwise distance to specify the similarity or dissimilarity between observations, and then clustered the most similar data points together in an iterative fashion hierarchically.

Cluster analysis heatmap of student course grades from: Bowers, A.J. (2010) Analyzing the Longitudinal K-12 Grading Histories of Entire Cohorts of Students: Grades, Data Driven Decision Making, Dropping Out and Hierarchical Cluster Analysis. Practical Assessment, Research & Evaluation (PARE), 15(7), 1-18.http://pareonline.net/pdf/v15n7.pdf

As shown in the figure, a cluster with an orange vertical bar on the left represents a pattern of consistently high longitudinal student grades across elementary to high school for all subjects; while a cluster with a purple bar (on the left) represents consistently low grades. The green vertical bar represents a pattern in which students start elementary school with relatively high grades but then progressively receive lower grades from about grade 4 onward. In contrast, the yellow bar represents students who start with low grades, but then grades rise around 4th grade and stay high throughout. When analyzing the likelihood of dropping out or not, 188 students can be grouped in two main clusters separated by the horizontal black dashed line, where the clusters from the orange and yellow vertical bar belong to one and the groups with green and purple bar belong to the other. The dropping out variable is listed on the right side of the heatmap, where 88.6% of students who dropped out of school belong to the cluster below the black dashed line with the majority of their grades in blue, representing low grades overall.

The central purpose of this study is to provide a useful means to disaggregate educational data and visualize the grade patterns across a large number of students simultaneously, without relying on overall averages that group all students together and hide the individual variablity. The above example, using the hierarchical cluster analysis and the heatmap technique, provides a means to display the entire cohort of students in a detailed way and identify the classification of the data patterns that can be applied in predicting future outcomes. Furthermore, this data mining technique can help pinpoint when student grades may stabilize for certain clusters of students. As recently noted in a related blog post by @science_goddess on the use of hierarchical cluster analysis heatmaps with education data: “Educational research doesn’t represent individuals as individuals, but as populations. We seek to generalize, because we feel we have to, given the amount of data we collect. But I can’t help but wonder what we’d see if we looked at all of the educational data we collect at both the micro and macro levels.”

The full study is in Bowers, A. J. (2010). Analyzing the longitudinal K-12 grading histories of entire cohorts of students: Grades, data driven decision making, dropping out and hierarchical cluster analysis. Practical Assessment Research and Evaluation, 15(7), 1-18. Open Access Preprint: http://dx.doi.org/10.7916/D8QC02TX

Featured