How to avoid creating a data minefield

State education departments collect mountains of data from their school districts to prepare federal reports, a tedious and cumbersome task regardless of the number of districts. To facilitate the process, the Georgia Department of Education recently hired Steve Gabrielson as its data-mining and warehousing specialist.

Gabrielson sees Georgia as leading the pack nationally, because the state is “very cutting edge” in its data-collection methods. Districts submit data to the state over the internet, with certain elements in specified fields. This format may be different than the format originally used to store the data.

The department’s “data error-checking system” goes through the data and generates a list of errors that must be corrected before the information is accepted. Once the information reaches the warehouse, the querying begins.

One way to do this, Gabrielson explains, is with Online Analytical Processing (OLAP), which “allows you to look at different dimensions of data you may not have thought of before.” Georgia plans to start with student enrollment, where “you can look at it by race, by gender, and look for patterns—then throw in test scores,” Gabrielson said. “The bottom line is student achievement and lower dropout rates.”

The chief component of an OLAP system is the server, which sits between a client computer and a database management system. The OLAP server understands how data are organized in the database and has special functions for analyzing it. There are OLAP servers available for nearly all major database systems.

The department hopes to include “external data”—census statistics and U.S. Department of Education data—with the “internal data” regularly collected from school districts, such as student grade-point averages, certified personnel, and student retention. Another area the department wants to analyze is the amount of money it spends—is it too much? Too little? Is the department getting enough for its dollar? The state ultimately wants to derive the cost of its educational programs at the classroom level.

Many of the department’s stakeholders—including the governor’s office, school districts, and various internal groups—have access to data through the state’s intranet and use it to enhance their education decision making. Eventually, Gabrielson would like to move all the information onto a web site, “so users can go on and slice and dice the data any way they want.”

The hardest part about designing a data warehouse, Gabrielson says, is determining what information you want to gather. Choosing “granularity,” or the ability to input data at its smallest level, will provide more information—thereby increasing the performance of the warehouse—but also makes the process of gathering data labor intensive.

Before making the decision about what to include, he advises interviewing the end user. “If you’re not asking the end users what they want to see, you may waste time by not coming up with information that’s really useful,” he said.

In general, Gabrielson suggests to others who are considering the use of data warehousing and mining to start small and think ahead. Focus on one area at a time, as Georgia has, by starting with student enrollment. Determine what you are going to use the data for and design a process for “cleaning” it. “This is a formal process and an important step, so take your time with this,” he said.

It’s also critical to “expect a lot more data, maybe even two to three times more” than you thought you would need, he said. For instance, using external data can be very helpful, but if you do, it “will exponentially increase the amount of your data.” Gabrielson said he is taking the time to study independent reviews of various data-warehousing technologies before making final decisions for Georgia.— CG

Want to share a great resource? Let us know at