A new lightweight version of NYC Human Resources Administration data warehouse gives users fast access to information they need.
In New York City, the Human Resources Administration employs about 15,000 people who deliver Medicaid and other services to more than 3 million residents — and its enterprise data warehouse stores so much information that employees can potentially sift through hundreds of millions of rows inside just one table in a spreadsheet.
“Our largest table had more than three-and-a-half billion paid Medicaid claims,” said Anna Stern, assistant deputy commissioner of new initiatives in management information systems. “And the public assistance and food stamp data comes out of the state system, which is a mainframe system, so we’re dealing with old, complex systems [and] very big data.”
But that’s not the case any longer, thanks to Stern and her colleagues. Because the enterprise warehouse’s data groups often are much larger than employees need, they created dataSmart, a compact version of the warehouse that only contains the most needed data for staff members, while presenting users with a simpler interface.
The massive enterprise data warehouse had been around since 2001, so developers learned over time what the most popular data sets were to include in dataSmart. “We took everything we’ve learned over the years and built a streamlined model,” said Data Warehouse Director Jane Neimand. “We’re very pleased with it.”
DataSmart, which went live with Phase I on Feb. 14, 2012, is housed on the same Oracle database as the enterprise data warehouse, and only department employees can access the systems. Both versions are side-by-side, and users must determine which to query for their needs.
Most people use the Oracle Business Intelligence Discoverer tool to access dataSmart, but other tools like SQL are available. In Discoverer, users query the system and receive the data in spreadsheet tables. DataSmart tables contain fewer rows than those in the data warehouse. “The step-by-step is going to be sort of the same, but each step is just simpler,” Neimand said. “[DataSmart] only has what they absolutely need. It’s not all this other data that they probably don’t need.”
For instance, the enterprise data warehouse holds more than 20 years of data history, which amounts to 1,500-plus data elements with tables containing hundreds of millions, and in one case billions, of rows. On the other hand, dataSmart holds three years of data history and only the most requested data elements.
Stern, Neimand and their co-workers work continually to keep dataSmart fresh. They rebuild the database from scratch each year to house only the past three years’ worth of information. “One advantage of doing that is that it keeps it small,” Neimand said, “but another advantage is that, let’s say some data element becomes important that was not important the last time we did a build, that gives us the opportunity to build in that new element.”
Education and experience have taught them which elements are the most important to include. Roughly 10 years ago, Stern learned in Data Warehousing Institute courses that only about 15 percent of a warehouse’s data is actually used. The administration therefore started developing dataSmart as a means to offer the most desired data more conveniently.
“We said, ‘What if we could come up with something simpler?’ And that’s what dataSmart is,” Stern said, noting that she believed many users lacked the analytical skills needed to access the original system effectively. “It is a distillation of everything we have learned from 10 [or] 11 years of running this data warehouse in this agency. Our premise is, small is beautiful.”
Phase I of dataSmart includes data from the welfare administration system; later phases will incorporate additional information, including GIS, supplementary security income and Medicaid data. Stern estimated that Phase II will go live on Labor Day, followed by Phase III at an as-yet-undetermined later date. The money for the phases came from the department’s budget, and Stern estimated that it will cost an additional $200,000 to complete the second phase.
Today, Stern oversees training programs that employees must complete before they can query either database, and the learning curve for dataSmart certification isn’t steep. Users complete four three-hour training sessions to learn dataSmart, compared to the nine two-hour sessions for the enterprise data warehouse. A division liaison must nominate someone for training before that person can be eligible for courses.
Business analysts are currently trained on Oracle’s Discoverer tool, but the company eventually will migrate the department to the Oracle Business Intelligence Enterprise Edition (OBIEE) platform. Stern and Neimand are unsure when exactly that will take place, but they wouldn’t be surprised if it took about a year to convert Oracle’s database modules to the new system. For example, Discoverer has files called “workbooks” that contain worksheets displaying data retrieved from the database, and migrating users’ personal workbooks to OBIEE will take a lot of work.
“Oracle cannot predict how well the Discoverer workbook will convert to OBIEE, so we’re going to try to get people to get rid of the workbooks they don’t need,” Stern said. “But that’s something that could take a lot of time.”
But no matter what happens, she and her team will work to ensure that dataSmart continues to deliver information conveniently. They’re even using Google’s search capabilities for inspiration.
“My goal would be to get as close to Google as we could get in terms of querying,” Stern said. “Google’s figured out how to give you all the information in the world with one little box on the screen. [It] makes all sorts of information that is complex acceptable to normal people. That’s the goal of dataSmart relative to the agency’s data.”