Data science is a relatively recent term and there is an ongoing discussion about its definition and how it relates to various other fields. At its core, it is an interdisciplinary field that combines elements of data analysis, statistics, and machine learning to turn structured and unstructured data into useful knowledge and insights.
In the day to day practice of organizations, a data scientist is someone who works to extract meaning from and interpret data, using tools from statistics and machine learning, as well as common sense and judgement. A data scientist is typically involved in a variety of activities. Some are strictly technical, like data collection, preprocessing, and analysis; others require human intuition. By understanding the domain of the problem, identifying the information that is really relevant to solving it, and giving recommendations based on the resulting analysis, data scientists help bridge the gap between engineering and business.
That process is often messy because reality is messy! Representing it meaningfully with numbers is a big challenge and often requires a fluid, iterative approach. That might look like this:
There’s a common starting point (a question or a problem) and a desired outcome (insights), but while there are some common approaches to the steps in between, in reality the process isn’t linear. Most projects will go through a few iterations of one, some, or all of the steps before conclusions can be drawn. Feature engineering - which means something very different to data scientists than it does to programmers! - is a central part of the process, which you almost always have to iterate on many times before refining an approach that will work for your problem.
What are features and why are they important?
In machine learning, a feature is an individual measurable property or characteristic of the object of analysis. This is equivalent to the statistical concept of a variable. Because data in the wild is chaotic and noisy, producing a set of well-crafted features can make a huge difference when it comes to model performance and predictive power. A perfect model will generate useless noise if the data that goes into it is not thoughtfully processed.
Feature engineering is the core of the data science life cycle. Manipulating raw data to produce features that represent the problem at hand is what lets us feed our analytical tools meaningful input. This is a fundamental step in the data science process that allows us to extract information, highlight patterns, and add domain knowledge. It's the best way to improve a machine learning model, but it’s rarely formalized. This is also where most of the magic happens.
What does this mean in practice?
Let’s say we have a bunch of photos of objects and we want to train a model to be able to recognize when the object in the photo is a motorbike. In ML this is a classification problem, where, for each object/photo, we want the algorithm to assign one of two labels: motorbike and non-motorbike. The problem of feature engineering, in this case, comes down to finding a set of features which describe the photo in a way that makes it easier for the model to assign the label correctly.
Let’s assume that we want to limit the complexity of our model and only use two features. To a computer, each of the photos is an array of numbers representing the color of each pixel, so one naive approach might be to pick two pixels, say the first two in the array, and use them as features for our model. This is equivalent to asking, for example, given that these two pixels are a light shade of gray, is there a motorbike in this photo? That’s a pretty hard question to answer for a person, let alone a computer! Using this set of features, any learning algorithm will struggle to accurately classify our images.
An alternative approach might be to create (engineer!) two binary features: wheels is true when the object in the photo has wheels and false otherwise, and handlebar, similarly, is true when a handlebar appears in the photo. Now the question we’re asking is given that there are wheels and a handlebar in this photo, is the object a motorbike? This is a much easier question to answer, and while the model might not always assign the correct label, training it on this set of features will yield much more accurate results.
Both examples use only two features, but the second set of features performs better because it embeds information that is more relevant to discriminating between the two classes: intuitively, knowing that some pixels are gray is way less meaningful than knowing that there are wheels or a handlebar.
This example shows how we can obtain wildly different results with the same number of features. By using our domain knowledge (motorbikes have wheels! And handlebars!), we were able to create new features that efficiently represent the problem space, and increase the predictive power of our model.
Some examples
Because there is no formal definition of feature engineering, different data scientists will have slightly different opinions of what counts as part of the process and what doesn't. For example, some will include imputation of missing data as one of the steps, while others will put it under data cleaning.
A simple example of feature engineering is the decomposition of a categorical attribute into as many binary features as categories, e.g. a color attribute having values red and blue can be decomposed into two binary features, is_red and is_blue.
Similarly, datetime values can be decomposed into their constituent parts, like hour_of_the_day or day_of_the_week. Doing so may help discover temporal patterns and relationships with other features.
For text data, a common solution is to represent each document as a matrix of word counts, but one could also choose to turn it into a presence/absence matrix (1 if word w appears in a document, 0 if not).
So how do you come with features?
Coming up with features is difficult, time-consuming, and requires expert knowledge. "Applied Machine learning" is basically feature engineering. — Andrew Ng
This list of examples is, of course, not exhaustive. There are many ways to tackle any problem and no single, unified approach that will guarantee the optimal set of features for a given use case. Which features to extract and how to combine them depends on the domain of the problem, the kind of insights you want to produce, and what data is available. Talking to experts, practice, and experimentation are all excellent ways to understand the problem and approach feature engineering.
Like many other things in data science, this is a fluid, iterative process involving a lot of trial and error. Often, you will return to revise your features as more data becomes available, priorities shift, or the insights originally produced by your models seem less accurate in practice. It may look like magic, but it really is about having curiosity for the elements that best describe and summarize a problem, and developing intuition for how to represent them as numbers.