Splunk, a San Francisco-based big data company, specializes in capturing and analyzing machine data generated by websites, applications, servers, networks, mobile devices and sensors. Public CIO talked to CIO Doug Harr about how customers use the technology, how it will evolve, and the tension between big data and privacy.
The name Splunk came from spelunking, or cave exploration. We go into the data and find that nugget of gold. The company was founded to help IT operations people running data centers at big companies like Apple and Disney. When something went wrong, they typically looked for errors in the machine data created by servers, storage devices and other data center systems. When they found an error in the data logs, they searched the Internet to see how other people had handled the same issues. The founders of Splunk realized that if you collected all that machine data into one big storage mechanism and put analytics on top of it, you could do a much better job of understanding what is happening with the machines in your data center.
Then the technology moved into security because people realized that putting a security device at the edge of your network wasn’t going to protect you from all threats. Hackers were getting into the network and they were using advanced persistent threats. So, in some ways, we were the happy recipient of the idea that this same machine data that lets you improve data center performance can help you monitor the security profile of your systems. You need to be looking for abnormal behavior.
If you are looking to adopt big data technologies, the only barrier is approaching it the wrong way. We have best practices showing that successful customers start with a defined project and something specific they want to accomplish. There is a lot of talk in the market about just grabbing every possible bit of data you can and putting it somewhere in case you ever need it. I don’t subscribe to that. I say look at your data sources and figure out what data is likely to have real meaning.
If you are running a public website, a very easy example is to pull all of the Web logs from machines that are part of the site to make sure that people logging in are who they say they are. That data also will show you the most popular parts of your site and let you monitor the customer experience. A lot of agencies have started with us just to make sure they’re not losing data. Then you move up and move out — you do more proactive monitoring or more business analytics.
You are going to have some surprising moments where you realize, for example, that because you can see when someone logs in or out, you can tell how long it takes to process a traffic ticket. You just keep building on those discoveries.
Our customers are going beyond IT security and data center excellence. They’re looking at other sources of machine data. And new types of institutions are looking at data produced by other kinds of machines beyond IT. Utilities, for instance, are looking at water delivery — monitoring utility infrastructure. This is really the industrial Internet. As all of the things in our daily lives become IP enabled, they give off the same kind of data that our traditional IT systems always have. Nest Learning Thermostats — that evaluate whether you’re home, learn your preferences and manage energy cost — use Splunk. They are taking the output of that thermostat and doing the same kind of analytics we do in a data center, but using that information to understand the human experience and correlate between different sensors and systems to give a more complete view of a customer.
We pride ourselves on having a security model in our product suite. Establishing roles and access levels, segregating the kind of data you collect, and managing that data in a secure environment are very important parts of big data. I’ve watched with interest the recent debate over the federal government’s data collection efforts and the discussions about metadata versus actual data. Those questions are very germane.
We have an email management app that doesn’t look at the content of your email messages, but it does know what the traffic patterns are. You could make an argument that even doing clever correlation of where email is coming from and going to is very private information. So that’s all part of the mix and it’ll continue to be a focal point as everyone develops the mores and best practices for dealing with this kind of data. You have to have a security model or you don’t have a place to start.
I think the advent of this controversy is actually good for the market. It’s educating the market about key considerations like who gets what kind of data. It’s very important not to rush into these projects without considering the privacy and security implications. I think it helps us and makes some of those conversations easier in some ways when people are aware of the debate. ¨