Data scientists are sometimes handed a dataset and told to start developing machine learning. All too often this is an ill-conceived directive stemming from a lack of understanding of the rigor involved with modeling data. This type of direction, while well intended, is commonly driven by the excitement and promise of a successful machine learning effort. Data science and data modeling are best served with business partnership and representation throughout the data science life cycle. Breakthroughs occur when core competencies of business acumen, deft analysis, and technical skill sets are brought to bear against the most formidable challenges.
Professionals and students alike are starting to learn more about the various ways that machine learning may be applied and the numerous use cases that are already leveraging the technology. With the amount of data at our disposal and the ability to harness mass amounts of computing power, new applications are being discovered at a rapid pace, and discussion of the subject is erupting. Despite the topic’s frequent discussion, few people have insight into the complete workflow of a machine learning project.
Five steps are essential before beginning to build a machine learning model. These steps are not just for the technical audience, but for anybody involved with work that the data science team is trying to tackle.
Develop business understanding
The first step in the workflow is arguably the most important one, and without it you will find yourself aimlessly searching for trends within the data. Instead, before diving too far into the data, it is important to propose possible problem statements along with a high-level approach. The data science team should work with leadership, clients, and any other stakeholders to recognize and understand the business challenges. Begin to formulate questions that define the business goals you are attempting to solve via data. Once the objectives have been established, the team should map them out in technical terms and determine how success will be measured.
Gather data
The next step is to acquire the data needed to tackle the problem at hand. This step might seem straightforward, but often times becomes the most challenging part of a machine learning project. If data is internal to the organization, gathering data could be as simple as querying or extracting from a system database. In many instances, external data must be sought out and even blended with data you already have. Luckily, our world is moving towards an open data practice, which means extensive amounts of data are available to be freely used by anyone. Open data can typically be accessed via APIs or by downloading files from the internet. Other methods of data collection include scraping webpages using web crawler tools, extracting information from unstructured files, or collecting survey data.
Cleanse and assess the reliability of the data
It is rare to find datasets that are perfect, where there are no missing values and each data point can be assumed to be correct. Once data is acquired, it must be cleansed. This can mean a number of things, including changing the data type of columns, manually manipulating records, and determining how to handle null or missing values. During this step, it is important to evaluate the reliability of the data. Depending on the type of data or the its source, it is possible some features will follow an inconsistent pattern or faulty values or records. This will likely lead to additional cleansing, and a catered approach to manipulating the data must be devised. There is no fix-all solution that can be applied.
Explore the data
Analysis can begin only after you have acquired and cleansed your data. Applying exploratory data analysis will move you closer to understanding your data. Rich exploration will find trends, detect anomalies, discover relationships between variables, recognize the distribution of the data, and uncover any other information that might be helpful. Recall the main questions that were outlined in the first step. These will provide proper direction and aid in extracting the right insights. Exploring the data should include calculating descriptive statistical measurements for each variable and plotting the data to discover findings that might not otherwise have been noticed. As you continue to analyze the data and begin to understand it at a detailed level, new questions will come about.
Reevaluate the problem statement and develop an approach
The final step before beginning the model-building phase is to reassess the business goals that you targeted in the first step to determine if they are feasible. It is possible that you’ll have to frame the problem differently after further analysis or that you’ve exposed new potential solutions. Summarize your findings from the previous step and break down what it all means and how this information can be useful when thinking about building a model. Often, data science teams lend their own business domain knowledge to determine which features or attributes to include in their data modeling. While they may get it right, the desired end state should be the creation of a close partnership with business stakeholders to drive mutual understanding. With a solid understanding of the problem that you are aiming to solve and its contributing factors, develop an approach to the solution. Answer critical questions, such as whether it is a classification or regression problem, the types of algorithms that are appropriate to trial, and which features should be considered in the model or newly engineered. Paint a clear picture to make the model-building process as well-defined as possible.
As outlined here, there is an extensive amount of preliminary work to do before beginning to build models. Devoting plenty of time and thought and planning of these steps is key to building a well-performing model. Each step is part of a whole, and without one the rest will not be as effective. Therefore, it is essential to not overlook any of them. Proper preparation will lead to seamless execution.