I had a chance to work on a predictive analytics project for a US car manufacturer XXX (I will keep name of the company confidential). The goal of the project was to evaluate the feasibility to use Big Data analysis solutions for manufacturing to solve different operational needs. The objective was to determine a business case and identify a technical solution (vendor).
Our task was to analyze production history data and predict car inspection failures from the production line. We obtained historical data on defects on the car, how the car moved along the assembly line and car specific information like engine type, model, color, transmission type, and so on. The data covered the whole manufacturing history for one year. We worked on this project in a team of 3 people: Matous Havlena, Tim Ojo, Akin Alao. Following is our final report and presentation.
This project was chartered with two objectives. The first objective was to determine the feasibility of the use big data analytics to solve the problem of predictive vehicle inspection which involves predicting whether a vehicle would pass or fail the vehicle quality inspection based on its production history. The second objective was to evaluate IBM BigInsights and Datameer Analytics Solution (DAS) and determine which tool would be recommended for this type of analytics project.
Early in our research, we discovered that both BigInsights and DAS contained tools for descriptive analytics but required third party tools for predictive analytics, this led us to introduce IBM SPSS Modeler for predictive analytics. Both products took different approaches to integrating with a third party tool, with IBM providing tight integration amongst their products. This became an important factor in our recommendation of a solution suite for XXX.
We used an industry standard approach to data mining CRISP-DM in solving the problem. This included understanding the business processes and requirements, understanding and preparing the data and the actual creation of a predictive model for use in predicting vehicle inspection passes or failures. We were able to build a model in SPSS Modeler that achieved an accuracy of 85.4% over the initial set of data we received from XXX.
SPSS Modeler is able to integrate directly with BigInsights such that the Modeler interface is responsible for the design of the model but all the data storage and processing takes place in BigInsights. DAS takes a different integration approach to allow the use of predictive models in its solution. DAS uses the Zementis Universal PMMI Plug-In to allow models created in any third party analytics tool such as R, SAS, SPSS and KNIME to be imported into DAS as a function and executed. This integration approach has a disadvantage in that DAS does not provide any support for the model creation process unlike BigInsights. DAS also provides no support for managing the big data infrastructure. The model we built in SPSS Modeler was exported to PMML format and is accompanying this document together with original stream files from SPSS (.str format).
As a result of our research we demonstrated and concluded that big data analytics tools can be used to solve the problem of predictive vehicle inspection. We also recommend IBM InfoSphere BigInsights over Datameer Analytics Solution because of its Hadoop management capabilities and seamless integration with an analytics solution (namely SPSS Modeler).
1. Project Charter
The Big Data Analytics project was chartered with the goal in mind of evaluating the feasibility of the use of Big Data analytics solutions to solve different operation needs at XXX. A second goal of evaluating two Big Data analytics products and recommending one for use at XXX was also set.
The specific project that this report covers is the problem of Predictive Vehicle Inspection. The idea behind this project is to evaluate the ability of Big Data analytics tools to predict car inspection failures from the production line based on the vehicle production history. Possible historical data points that could be used in the prediction are characteristics such as what employees worked on the vehicle, time of day, supplier data, what defects occurred on the vehicle during the production process, etc.
This project was researched by Matous Havlena, Tim Ojo and Akin Alao. The project spanned an 8 week period starting from October 7 to November 30th, 2013. The two big data analytics tools that were to be evaluated are IBM InfoSphere BigInsights and Datameer Analytics Solution.
1.1 Project Approach and Proposed Solution
The type of data mining and analytics that deals with extracting information from data and using it to predict trends and behavior patterns is known as Predictive Analytics. In order to predict the probability of an outcome a Predictive Model must be created or chosen. Technically a model is a mathematical equation that determines the probabilistic relationship between data elements. Predictive models typically use classification algorithms to predict an outcome, which requires having a training set of data whose outcomes and variables are already known. For this project, we had production history data for vehicles produced from January, 2013 to October, 2013 that had either passed the vehicle quality inspection on the 1st try (a direct run) or required rework (failed on the first try). Based on the body status history, defect history and vehicle information for those cars for which we already knew the outcome of the quality inspection, our goal was to create a predictive model that can then be used to predict whether a new car introduced into the system will pass or fail on first try.
There were two tasks that needed to be completed in the setup and initialization phase of the project in order to lay the groundwork for the project; setting up the Big Data Analytics environment and obtaining the data needed for analysis from XXX. The body status, defect, and vehicle order data was given to us as 3 separate text files. We worked with XXX to mask sensitive information such as VIN numbers and employee IDs by using a hashing function. This gave us the ability to relate and join data from the 3 different files together without having access to the true data values for those sensitive columns. The 2nd pre-project task involved setting up the big data analytics environment. With Datameer Analytics Solution (DAS) we needed to establish a Hadoop cluster locally or in a public cloud first. We chose to build the Hadoop cluster locally which took some time and effort. With IBM on the other hand, we had BigInsights and Hadoop available in their Academic Skills Cloud, which meant setup and provisioning only took a few hours. For this reason we started out first with BigInsights and completed the research there before moving on to Datameer Analytics Solution (DAS).
In the process of setting up and researching both IBM Infosphere BigInsights and Datameer Analytics Solution we discovered that neither tool provided much built in predictive analytics functionality. However, we discovered that BigInsights provided tight integration with SPSS Modeler, which is an analytics workbench that allows users to build predictive models in a drag and drop interface by leveraging built in algorithms and functions. Predictive models can be explored, designed, tested and executed with SPSS Modeler as the front end user interface, while all the heavy lifting of storing and processing the big data with the instructions and classification algorithms provided by SPSS Modeler occurs on BigInsights on the back end. We will see more of this design in the next section of the paper. Datameer Analytics Solution takes a different approach to allowing its users extend its functionality to include predictive analytics. DAS allows users to import predictive models produced by predictive analytics tools such as SPSS, SAS and KNIME into the DAS function library. Users can then use these models in their DAS workbooks like they would use any other DAS function and the processing of the data will occur on the Hadoop cluster.
In the next two sections we will discuss our implementation and analysis of the IBM and Datameer offerings.
2. Implemented Solution – IBM SPSS and BigInsights
The following diagram represents tools we used in our analysis.
IBM SPSS Modeler Client
is an extensive predictive analytics workbench that is designed to bring predictive intelligence to decisions. It provides a range of advanced algorithms and techniques and allows users to build predictive models in a drag and drop interface by leveraging built in algorithms and functions. Predictive models can be explored, designed, tested and executed with SPSS Modeler as the front end user interface.
IBM SPSS Modeler Server
is a necessary part for connecting SPSS Analytics Server with SPSS Modeler Client.
- IBM SPSS Analytic Server
allows analyst to do predictive analytics over big data. It enables the predictive analytics platform to use data from Hadoop distributions to improve decisions and outcomes with the use of IBM SPSS Analytic Catalyst and IBM SPSS Modeler. You get big data analytics capabilities – including integrated support for unstructured or semi-structured predictive analytics from the Hadoop environment – that eliminate the need to move data, enabling optimal performance on large volumes of varied data without writing complex code and scripts. This scalable solution also supports popular Hadoop distributions such as IBM InfoSphere BigInsights, Cloudera, Hortonworks and Apache Hadoop.
IBM SPSS Analytic Catalyst
uses the power of SPSS Analytics Server to help accelerate analytics by identifying key drivers from big data. It automates portions of data preparation, automatically interpreting results and presenting analyses in interactive visuals with plain language summaries. It automatically discovers statistically interesting relationships in data and helps to understand data in an early data discovery stage. Its designed for for big data and massive scale environment.
IBM InfoSphere BigInsights
is an analytics platform, based on open source Apache Hadoop, for analyzing massive volumes of unconventional data in its native format. The software enables advanced analysis and modeling of diverse data, and supports structured, semi-structured and unstructured content to provide maximum flexibility.
2.2 Solution in CRISP-DM framework
During working on our solution, we were following CRISP-DM data mining methodology. CRISP-DM stands for Cross Industry Standard Process for Data Mining and it is a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.
CRISP-DM involves 6 stages. More detailed description follows:
Following is our solution described in CRISP-DM stages.
2.2.1 Business Understanding
In the early stage of our project, we were invited to XXX plant to set up the objectives of the project, understand business background and business success criteria. XXX showed us the manufacturing process, how quality is measured, what internal systems they use to track information about defects and car movements along assembly line and how processes work from the perspective of quality and checkpoints. We were also introduced to internal systems and manufacturing data they own. Overall quality strategy was communicated to us, so we were also able to understand manufacturing and quality process from the high level perspective and therefore focus on important factors.
Reduce costs associated with car poor quality and increase manufacturing process efficiency.
Data Mining Goals
Build predictive model that will predict whether a particular car is going to pass the vehicle quality inspection on the 1st try and therefore allow more efficient resource management and provide a basis for root-cause analysis that would reveal the most significant factors that lead to poor car quality.
2.2.2 Data Understanding
This stage was the most challenging part of the project. Nobody in our team had previous experience from automotive industry and we had to deal with new terms and processes typical for assembly line environment. Detailed description of data wasn’t available, although we had the chance to work closely with XXX and they helped us to understand the data correctly.
Collect Initial Data
We got initial data in space delimited format. The data included information about defects (3 million records), cars movement along assembly line (5 million records), and essential information about the car order (100 thousand records). We had to restructure data into comma delimited format before we could start to explore data in IBM BigSheets. Collected data was uploaded to IBM BigInsights and its Hadoop Distributed File System.
original field data description field1 CAR ID field2 description of the defect (defect + object) field3 high-level group where is the responsibility field4 where did defect happen field5 position field6 object field7 type of the defect (problem) that happened field8 location of the defect field9 timestamp when the defect was reported field10 timestamp when the defect was reported as fixed field11 employee ID who open the defect field12 employee ID who closed the defect field13 station where defect was open field14 station where defect was closed field15 who is responsible for the defect (originator)
original field data description 1 car ID 2 timestamp of the order 3 status code 4 checkpoint ID 5 status description 6 employee ID 7 car ID 8 VIN number
original field data description 1 car ID 2 VIN number 3 exterior color 4 interior color 5 car model 6 transmission type 7 type of the engine 8 model year 9 transmission 10 equipment pieces (radio + type, …)
We did preliminary exploration of data in IBM BigSheets where we performed couple of join, filter and order operations in order to get familiar with data and see relationships among foreign keys. After preliminary exploration, we used IBM SPSS Analytic Catalyst that automatically explored data and discovered statistically significant relationships.
Verify Data Quality
To verify data quality (missing values, extreme values, …), we used SPSS Modeler Client and its Data Audit function.
2.2.3 Data Preparation
Data preparation stage was another challenge in terms of data transformation into an appropriate structure that could be used to build predictive model. Our source data included car information spread among several rows – multiple rows representing one car. Those rows had to be converted into one row (that is representing one unique car) with several fields (columns) representing information about the car (predictors).
The data preparation stage was done in SPSS Modeler. Following is the screenshot of data stream that read data, transform it to appropriate format and build a predictive model.
The screenshot shows a data stream (flow) in SPSS Modeler. We used fields and records operators like data aggregation, filtering, sorting, restructuring, deriving new fields, and merging to achieve the following structure that is later used to build a predictive model. The 15 fields you can see in the following picture is just a simplified version of the original 425 fields that were used as predictors.
Every row represents one unique car. First field is the car ID. Field NumberOfDefects represents total amount of defects car had before arriving to the final quality checkpoint. StartDayInWeek represents a day in the week when assembly of car started (1=Sunday, 7=Saturday). StartHour represents hours when the assembly started (in 24 hour period). TimeToAssemble indicates how long the assembly took – notice that some numbers are negative what is a result of Automatic Data Preparation node within SPSS Modeler. The last column IsGoodCar is our target value that indicates whether car passed the final quality check point for the first time (T) or not (F).
We were approaching Data Understanding, Data Preparation, and Modeling parts of the project in an iterative process. We were adding more predictors on the way so we were moving from Modeling stage back to Data Understanding and Data Preparation stages until we came up with the final model we are describing in this document.
Select Modeling Techniques
SPSS Modeler offers a number of models. Those models can be classified into three main approaches: Classification, Segmentation, and Association. We used Classification approach that can predict a field, using one or more predictors.
SPSS Modeler contains around 18 different classification models. At the beginning of our project, we were using Auto Classifier. Auto Classifier allows you to build several models and automatically choose the best 3 models that achieve the highest accuracy.
After running Auto Classifier, we focused on models that achieved the highest prediction Accuracy and we tried to improve their accuracy by changing model and boosting settings.
We used Analysis feature to assess the quality of our model. We were primarily focused on model accuracy. Initially we used Neural Network model that were achieving accuracy of 70%. By adding more predictors and trying different models and different boosting techniques, we ended up with C5.0 model and accuracy of 85.4%.
Our C5.0 predictive model achieved accuracy of 85.4% (on training data set) with 425 predictors.
Following is the model outcome showing just a small fraction of predictors (from the total of 425). The first column is not considered as a predictor. The third field from the end (isGoodCar) is the original target value saying whether car passed the last quality check for the first time (T) or not (F). The second field from the end ($C-isGoodCar) is the predicted value calculated by the model. The last field ($CC-isGoodCar) is the confidence of predicted value. Rows with the green cells were predicted successfully.
Determine next steps
Building a predictive model is an iterative process. The next steps should be focused on incorporating more predictors into the calculations so the accuracy of the model can improve and better root-cause analysis can be performed. The diverse dataset is a key, so incorporating following data should be in the next steps:
- Plant environment data (temperature, humidity, pressure)
- HR data – information about employees working on the cars (experience, age, …)
- Supplier & parts data
- Warranty data
- Social media data
Another important step is to incorporate predictions into actual plant processes and achieve desired improved efficiency and better resource management.
3. Analysis of Datameer Solution
Datameer Analytics Solution (DAS) is an easy-to-use data analytics tool that builds on the power and scalability of Apache Hadoop. DAS primarily consists of 4 major components; data source integration tools, storage, an analytics engine, and visualization tools. In addition, it also offers an App Market for sharing pre-built data analytics applications.
DAS’s ease of use is predicated on its workbook style interface. Data files can be uploaded unto its file system or gotten from several data sources using its various connectors. There are connectors to various web services such as Amazon EC2, Facebook, Google Analytics, as well as to databases (via JDBC), HBase, Hive and remote file systems (via SSH/SFTP). If the data is unstructured, such as log files, XML files or text, it must be parsed first. Once the data is parsed and imported, DAS offers a spreadsheet view of a subset of the data. The spreadsheets have a familiar interface for anyone who has worked with Microsoft Excel and make visualizing the data simple. Data analysis is done using these workbooks. There are over 175 analytical functions that can be applied in the workbooks. These functions can be used in a similar fashion to the way Microsoft Excel works except that calculations are done on a column basis not on a cell basis. The analytical functions available range in type from comparing and math functions to text and grouping functions. For example, an available function is the GROUPSTDEVP function which estimates standard deviation on the entire population. Different workbooks can also be joined together using an Inner, Outer or Self join to enable analysis over varying datasets. Simple and complex formula based filters can be used over the data as well as sort functions. In addition, because Datameer is extensible, you can create custom analytical functions or use ones created from third-party tools.
After the analysis is done on the workbooks, which contain a subset of the data, it can be run on the full data set and then visualized using an extensive set of visualization tools provided in DAS. DAS places emphasis on visualization as the primary way of conveying the information discovered as a result of the data analysis effort. These visual reports are called infographics.
Datameer offers extensive documentation and several examples/demos on how to use DAS for common real world analytics efforts. Other means of attaining support is through context sensitive help, online forums and standard and premium support options.
As mentioned previously in the product approach section, we decided to build the Hadoop cluster required for Datameer Analytics Solution (DAS) locally as opposed to in the cloud. That decision ended up cutting into our 8 week time frame for the project as building the Hadoop cluster time some time and effort. However, the installation of DAS was quick and easy and can be done under an hour. A production deployment may take longer because the MySQL database must be installed and configured if one is not available already.
DAS’s primary analytics functions are descriptive in nature, i.e. describe and reveal features about the data. DAS offers a predictive analytics tool; the Recommendation Engine. The recommendations engine automatically predicts the interests of a person based on historical observations of other people’s interests. It takes in 3 pieces of information; the user, the product the user purchased and the rating given to the product by the user. It can then produce a list of other items the user may be interested with a corresponding rating score. As more predictive analytics tools are not available in DAS, in order to accomplish the type of analysis that this project requires we needed to take the DAS approach of extending the functionality of DAS using custom functions. With this approach, predictive models that are created using third party analytics tools such as SAS, SPSS, R and KNIME can be executed in DAS as part of the workbook. This occurs by exporting the model created in the external tool into a PMML file. PMML (Predictive Model Markup Language) is an XML-based file format developed by the Data Mining Group to provide a way for applications to describe and exchange models produced by data mining and machine learning algorithms. It supports common models such as logistic regression, decision trees, neural networks, clustering models, etc. The predictive model in the form of the PMML file can now be imported into DAS using the Zementis Universal PMML Plug-in, which makes it available as a custom function. This model could then be used in a workbook that combines the body status, defects and order status table together depending on the inputs required by the model.
The majority of the CRISP-DM solution mentioned in the previous section which was applied when working on the IBM implementation applied to the Datameer solution as well. The business understanding and data understanding process was shared for both products. We were also able to export the predictive model created in SPSS Modeler and import it into DAS using the Zementis plugin which made it available as a function. However, due to the short time frame for the project and the extensive work done on the IBM implementation, we did not completely implement all the data preparation and model execution steps needed to predict vehicle inspection successes or failures in DAS. We did however, do enough research and analysis to make an informed recommendation which is presented in the next section.
The second goal of this project was to provide a recommendation of one software package over the other; IBM InfoSphere BigInsights versus Datameer Analytics Solution. The recommendation we provide is solely based on this particular use case of performing predictive analytics in the manufacturing process of the automotive industry. We recognize that other fields and industries have other needs and for those situations would recommend the tool(s) that best meets their needs.
Before we offer a recommendation, we need to answer the question. Does this problem require a Big Data analytics solution? And the answer is that it depends on the size and variety of the data. The predictive model we created used 12 months of production data and a limited number of predictors and attained 85.4% accuracy on its predictions over the training set. One way to improve the accuracy of the model is to include additional predictors such as plant environment data (temperature, humidity, pressure), specific employee data, supplier and parts data, etc. This will necessitate pulling in additional data sources into the model which increases the amount of data that must be processed. Increasing the amount of production history data used in the model will also improve the reliability and accuracy of the model. The amount of production history data used for this project was only 757MB in its raw form (the initial size was 3.71GB but it was cut down drastically after unnecessary whitespace characters were removed). With the current data size the use of powerful multi-core servers is still an option that cannot be ruled out yet. However it must be noted that the amount of data will likely need to grow in order to achieve higher levels of accuracy.
Both the IBM and Datameer offerings are very capable tools with areas in which they excel, however, based on our research for this project the solution we recommend is the combination of IBM SPSS Modeler and InfoSphere BigInsights.
As mentioned earlier, Datameer Analytics Solution is highly focused on descriptive analytics over Big Data. The purpose of descriptive analytics is to discover relationships in data and describe the main features of a data set. The majority of the analytical tools in DAS are descriptive and not predictive and of the 4 tools in the Smart Analytics suite, 3 of the tools are descriptive with a limited Recommendation Engine tool being the sole predictive analytics tool. Because DAS is specialized in descriptive analytics and visualizations it is not a suitable solution for task of predictive vehicle inspection. As noted in the previous section, DAS can be extended using the Zementis plug-in to allow algorithms that are developed outside of its suite to be run in the DAS environment. This gives the user the freedom to develop a predictive algorithm using any tool they prefer such as SAS Enterprise Miner, SPSS, KNIME or R and then import the algorithm as a function that can be applied to records in DAS. The problem with this approach however is that it does not allow the process of the development of the predictive model to be properly integrated into the overall Big Data analytics environment (Hadoop). For this project we needed to be able to not only execute the model using Hadoop but to also develop, test and tune the model using Hadoop. With the DAS approach, the analytics tool used to produce the predictive model would be expected to use a smaller subset of the data instead of using the full data set as backed by the processing and storage power of Hadoop. Theoretically, it is possible to link some analytics tools to a Hadoop infrastructure in order to allow them develop, test, tune and execute over the full data set but then that would render DAS redundant and unnecessary for the purposes of this project.
The solution provided by IBM is more comprehensive. IBM InfoSphere BigInsights goes beyond the BigSheets functionality (which is analogous to Datameer Analytics Solution) but it also offers installation and management support for the Hadoop ecosystem, text analytics, monitoring, workload optimization and a Big Data SQL engine, in addition to other capabilities. BigInsights allows the user to easily install, manage and operate their entire Big Data Analytics infrasture from one web based tool, whereas with DAS there are no management capabilities available therefore management must be done with a separate tool. DAS also does not offer text analytics or Big SQL. Although text analytics is not needed in the context of this project, Big SQL offers analysts a familiar method of writing queries over their Big Data dataset. In addition to the additional capabilities of BigInsights over DAS, its integration with IBM SPSS Modeler is also an advantage over the combination of DAS and another tool. SPSS Modeler can communicate directly with BigInsights during the development and execution of a model. The user only has to interface with SPSS Modeler which will take care of ensuring the data is stored and processed optimally using BigInsights. And while the user is not locked in to using SPSS Modeler to create the predictive model when using BigInsights as the back end, the integration, ease of use (due to the drag and drop interface), and comprehensive collection of pre-built algorithms for just about any problem makes a compelling argument in its favor over KNIME or SAS Enterprise Miner. SPSS Modeler is also considered by many to be a better product than SAS Enterprise Miner in terms of licensing, ease of use and flexibility.
Those are the reasons why we would recommend the combination of SPSS Modeler and BigInsights for the task of predictive analytics for auto-manufacturing.
In conclusion, our team’s finding is that Big Data analytics tools are relevant in solving problems in the automotive manufacturing industry. They present an opportunity to introduce additional insights and intelligence in many areas of the process, specifically in the quality assurance area. This project, as well as efforts by other car makers such as BMW and Volvo, demonstrated that big data analytics can provide value by helping pinpoint the root causes of failures and accurately predict when failures will occur, in so doing provide an early warning system.
We also found that the better solution for predictive analytics in the automotive manufacturing industry is the combination of IBM SPSS Modeler and BigInsights as it provides better management, easier and more seamless integration, and better capabilities than the combination of Datameer and an external analytics package integrated using the Zementis Universal PMMI Plug-In.
Will it be possible for you to provide the dataset used for this project.
I mean all the csv files
You can leave the sensitive information like employee id and VIN numbers ..
I shall create those for myself .. Or may be you can change these numbers with some random values. Would appreciate your help !!
This is the most detailed modeling practice I have been.
Thanks a lot.
Shanky- similar data can be found easily on UCI data sets or elsewhere.