The Role of the Data ScientistWhat Skills Are Required, and How Can Organizations Benefit?
With the increasing amount of data being collected by organizations, the role of the data scientist has emerged to aid in analysis. What's unique about the role and what job functions does it entail?
Phil Neray, head of security intelligence strategy and marketing for Q1 Labs, an IBM company, says the role of the data scientist emerged a few years ago as a way to analyze structured (databases) and unstructured (Facebook posts, tweets) data.
"The idea is you get all this data together and, using statistical methods, using various other mathematical approaches, find ways to analyze the data and extract intelligence from it," he says in an interview with Information Security Media Group's Eric Chabrow [transcript below].
The role evolved from mathematically-oriented jobs, and at first served as a business function. But recently, security intelligence has become a key role for the data scientist, Neray says, as a result of the increasing threats and vulnerabilities to organizations, such as breaches.
Data scientists need to be able to understand algorithms, data structures, optimization and parallel distributed computing, Neray says, in order to conduct large-scale analysis within an organization.
"The issue is that many U.S. organizations have some type of data - log data - in their environments," such as evidence of a breach, Neray says, "that's buried, but they have no efficient way of analyzing all if this information that they already have and finding what I would call the needle in the haystack."
That's where the role of the data scientist comes in, Neray explains. By analyzing data across the entire environment, data scientists can find correlations through analytics to discover what security event is going on and how to stop it.
In the interview, Neray:
- Delineates the skills of data scientists;
- Defines their responsibilities;
- Explains the evolution of the occupation.
Neray was a vice president at Guardium when IBM bought the database security provider in 2009. Earlier in his career, he served as a senior director at security provider Symantec. He started his career as a field operations engineer with Schlumberger working on remote oil rigs in South America. Neray has worked in a variety of fields, including database security; security patch and configuration management; 3D animation and special effects; and parallel supercomputing.
Data Scientist Defined
ERIC CHABROW: The term is new to me. What's a data scientist?
PHIL NERAY: A data scientist is a new role that's emerged over the last couple years to describe someone who specializes in analyzing large amounts of data. It can be structured data, the kind of data you would find in a database, such as a database of all your customers. It can be unstructured data, such as analyzing all the tweets that have been posted about your company and figuring out what the trends are with respect to what people are saying about your company. It can be mathematical data. The idea is you get all this data together and, using statistical methods, using various other mathematical approaches, find ways to analyze the data and extract intelligence from it. The first example was business intelligence, and then more recently we've seen security intelligence.
CHABROW: Did the data scientist position evolve from other types of IT or business jobs?
NERAY: It evolved mainly from mathematically-oriented jobs. It wasn't IT-specific. It was typically someone who had some advanced education in the areas of statistics, optimization, mathematical modeling. I'd say companies like Google were probably among the pioneers of taking massive amounts of data - in the case of Google, all of the click information that we generate across the Internet - and using that to extract information; for example, about which ad you'd be most likely to click and see. That would probably be an early example of how, from an IT point-of-view, data scientists began using all this data to solve an IT type problem.
CHABROW: What are some of the key skills a data scientist has?
NERAY: I think the key skills should be what you would expect. It would be mathematically-oriented folks who understand algorithms, who understand data structures, who understand optimization and then from sort of an architecture point-of-view understand parallel distributed computing, because in order to be able to perform this large-scale analysis, you can't use it using traditional architectures. You have to have parallel computing; you have to have typically distributed data repositories so it's not all stored in one place and you need ways to sort of build those types of environments from tools that are already out in the world.
Evolution Towards Security
CHABROW: You were saying that initially it dealt with business intelligence, but there's been an advancement or an evolution towards security. Can you explain that, please?
NERAY: The problem that most organizations have is that they don't even know they've been breached. Recently, there was a statistic from the Data Breach Investigations Report that showed that 85 percent of breaches are undetected by the breached organization. In other words, they've been breached. They don't know they've been breached. They only find out through some third party, such as a consumer whose credit card has been used in a fraudulent way, or the FBI or another law enforcement agency that finds out that they've been breached. For example, that was the case in the U.S. Chamber of Commerce, [which] found that its environment had been breached for over a year by hackers in China who were stealing sensitive information about the strategic plan. And they didn't even know they'd been breached.
The issue is that many U.S. organizations have some type of data - log data - in their environments and that the information that they've been breached is buried in this data, but they have no efficient way of analyzing all of this information that they already have and finding what I would call the needle in the haystack, which is a combination of events or log data or something in the network traffic information that they have that would let them know that something's going on that shouldn't be going on.
Security intelligence is taking all of that information, the log data, information about - for example - transactions that are going on in their environment, even in the case of a bank, ACH data that they're using, and combining it with other information such as sales log-ins on certain servers or firewalls that are blocking certain traffic, taking all that information and using analytics correlated and finding the patterns that would tell you that something's going on that you need to investigate or stop.
CHABROW: Would the data scientist be the one on staff to sort of determine what kind of data would be used to determine this?
NERAY: Typically it would be a multidisciplinary team. The data scientist's job would be to apply mathematical tools to analyzing this data and to build some type of architecture to efficiently analyze it from an IT point-of-view. But you're going to need other folks on the team as well that are more subject matter experts in the area of security, people who can tell you which data you want to get, how you're going to get it from various devices, servers or firewalls or IDS/IPS systems or switches in your environment, and people who are going to understand your business, too. You typically need business operations people that can also help you figure out how you're going to get data from the business side that you can then combine with data from the IT side to find examples of breaches or insider fraud, for example, from all this data you've collected.
Evolution of the Position
CHABROW: The people who become a data scientist - as you say - a lot of them have math backgrounds. Is this the first time they've been in IT organizations or is this something they've had other responsibilities using their math background and they've evolved into this position?
NERAY: I would say it could come from either one. For example, on Wall Street there are lots of mathematically-gifted folks working on optimizing trading algorithms, for example. Those same types of skills could be used here, but it does certainly help and is probably required to have some knowledge of computer architecture as well because you're going to need to create an efficient parallel approach doing this.
Now, some people are on the leading edge. They're using open-source tools to perform this analysis. But then there are security vendors like IBM with Q1 Labs that have built a parallel distributed architecture for you, and so there what's required more is an understanding of the different sources of data in your environment and the contacts who are analyzing the information. It's less about building the architecture because the architecture is off-the-shelf and more about understanding the data and being able to build the correlation rules that are specific to your environment.
Academic Training, Certifications?
CHABROW: Are you aware of any kind of academic training or any other kind of training for people who want to become data scientists?
NERAY: I'm not aware of any specific courses of study, but again, I think a data scientist's training would involve understanding mathematical methods like linear algebra, numerical analysis, statistical analysis, optimization, machine learning, data mining - these are all examples of various fields that you need to understand to be able to apply it to a real-world situation.
CHABROW: People who have this title, I'm assuming they probably have graduate degrees in math and/or computer science?
NERAY: It's certainly helpful to have both the computer science and the math background, and they could be an advanced degree. Or it could simply be lots of experience actually doing it in real-world situations.
CHABROW: Are you aware of any kind of certification for this?
NERAY: It's a very new field. I'm not aware of any certifications yet. But I assume it will come out soon.
CHABROW: Do you think most employers know they need a data scientist?
NERAY: I think it depends on the industry. We're starting to see it in a couple of areas. Security is definitely one of them, and we're starting to see it perhaps more in the financial services area as leading edge. But it will soon spread to other industries as well. And we're also seeing it in the consumer marketing field, where people want to analyze lots of unstructured data, for example, as far as tweets and Facebook postings and things like that, to look at trends in terms of what people are saying about your organization.