At the core of today's state-of-the-art artificial intelligence (AI) algorithms is the ability to learn complex patterns from a sample of data. In the manufacturing context, an example of a pattern might be the ways in which a set of parameters contained in that data, which are related to a process in a factory, vary together. When considering AI, it’s important to understand what the data requirements are at the outset.
The algorithm learns the patterns by being shown many examples of the parameter values in question — typically between a few thousand and several million. This data sample is a representation of the history of the factory process. Now, if a trend exists in the sample to the effect that, for example, every increase in the process temperature by 1°C tends to be accompanied by a decrease in the process's time by 10 seconds, the AI will learn this apparent relationship between the temperature and time parameters. In this way, the AI effectively learns a model of the process. It does so automatically, assuming that it is properly designed and fed enough examples of the right data.
What is the right data for AI?
What constitutes the "right" data for AI-enabled process optimization? The general answer is the set of data that is sufficient to describe how changes to a process's parameters affect quality. The bulk of process data can generally be represented as a table, or a collection of tables, comprising of columns (parameters) and rows (production examples, representing, say, one production batch per row). In order to be meaningful as a representation of a process, or more specifically of the history of a process, these tables need to be accompanied by some explanatory information. Start by taking a look at the kinds of explanatory information that are necessary, before discussing the data requirements in terms of those tabular columns and rows.
The key pieces of explanatory information, required by the data science team, are:
• A high-level description of the physical process;
• A description of the flow of production through the process (normally in the form of a process flow diagram), including in some contexts the time offsets between process steps; and
• A description of how the data table(s) relate to the process.
Some of these descriptions can be obtained from the available technical documentation. In most cases, however, the necessary insights can be learned by walking through the data tables with specialists from the factory.
Due to the nature of AI-enabled parameter optimization, there are some clear fundamentals that the bulk of the data — the data tables — needs to satisfy. This paper outlines these fundamentals in terms of data columns, before turning to the row-wise requirements.
Data columns: a representation of quality
The data columns need, first, to include a representation of the quality result. It’s important to note that data might not contain a full representation of how quality is measured in the factory. These gaps in the data are common (batch sampling, for example): in some cases the available data can be sufficient to achieve dramatic results.
The second set of required data columns concerns process parameters. These fall into two types: controllable and non-controllable parameters.
• Controllable parameters are the ‘levers’ available to the factory operator to alter the process and thus to improve quality. In general terms, these could include controllable aspects of the process chemistry, temperature, and time.
• Non-controllable parameters represent inputs to the process that cannot be controlled by the plant operator from day to day, such as the ambient temperature, the identity of the machine (in the case of a parallel process), or characteristics of the input material.
These parameter columns should together represent the factors that have the greatest influence on quality.
However, due to the ability of AI models to learn complex interactions in a large number of variables, a manufacturer is best advised to make all available data points around the process available for inclusion in the AI model. The cost of including additional variables is low. A good AI specialist will employ the necessary statistical techniques to determine whether the variable should be included in the final model. Variables that might be considered marginal at first may contribute to an AI model that leverages effects and interactions in the process, of which the specialists had previously been unaware, potentially resulting in an improved optimization outcome.
Row-wise data requirements
Let's turn now to the row-wise data requirements. The general rule here is that the data needs to be representative of the process, and in particular of the interactions that are likely to affect quality in the future. A basic aspect of this is to ask: How many rows, i.e. production examples, make a sufficient training set? The answer depends on the complexity of the process. The sample needs to be a sufficient representation of this complexity. In the manufacturing context, the lower bound typically ranges from a few hundred to several thousand historical examples. Training a model on more data than is strictly sufficient, however, tends to increase the model's confidence and level of detail, which in turn is likely to further improve the optimization outcome.
A sufficient number of historical examples does not in itself guarantee a representative sample. The historical examples should also be representative with respect to time. The data set should be sufficiently recent to represent the likely operating conditions — like machine wear — at the time of optimization. In many cases, the data should also represent one or more sufficient periods of continuous operation, as this allows the AI to learn which operating regions can be sustained as well as how effects from one part of the process propagate to others over time.
Consistency and continued data availability
This brings us to the last key data requirement, namely consistency and continued availability. In order to keep the AI model current with operating conditions on the production line, fresh data needs to be available for regular retrains of the model. This in turn requires some level of integration with the data source. In a worst-case scenario, this might mean a continuous digitization process if the record-keeping system is offline or manual exports of tabular data is done by factory technicians. These approaches are relatively labor-intensive and may be subject to inconsistencies. An ideal setup would consist of a live data stream from the manufacturer's data bus into a persistent store dedicated to supplying the AI training pipeline. For some manufacturers, a mixture of approaches is appropriate to cater for multiple plants.
Continued data availability goes hand in hand with the requirement for data consistency. This can best be illustrated with a negative example, in which a factory intermittently changes the representation of variables in data exports, such as whether a three-state indicator is represented as a number from the set {1, 2, 3} or as a string of text from the set {'red', 'orange', 'green'}. If uncaught, these types of changes could quietly corrupt the optimization model and potentially result in a negative impact on process quality.
The digitization and automation of process data infrastructure and data exports goes a long way toward addressing these issues. Whatever the factory's data infrastructure, however, a good AI ingest pipeline should feature a robust data validation layer to ensure inconsistencies are flagged and fixed.