Data has long since become one of the most valuable resources of our time. However, this value only arises when the often hidden correlations are recognized and the right conclusions are drawn. Because this is a highly complex undertaking, an interdisciplinary field of study in computer science and statistics has emerged: Data mining uses a variety of computer-aided methods to unlock the secrets of the data treasure trove. Association analysis, which makes use of some astonishingly simple principles, has been particularly successful.
You are reading an auto-translated version of the original German post.
What is an association analysis?
Association analysis is a data mining method for identifying correlations between objects within a database. Based on frequencies, it can be used to determine whether different combinations occur with a certain probability. The ultimate aim is to establish association rules, which can typically be expressed using simple if-then sentences (X → Y). Other methods such as variance analysis, on the other hand, are more concerned with numerical properties and target values.
Association analysis can be largely automated using various algorithms. As this involves computer-controlled data analysis with decisions and forecasts derived from it, it is a matter of Machine Learning in the classic sense. The technique has its origins in shopping basket analysis, which examines the relationships between purchasing decisions. This is still the most common use case today and will increasingly serve as an example here. However, the principle has since been extended to other data structures.
Important basic terms
The first step is to define some basic terms and key figures. Their mathematical-logical linking ultimately forms the actual process of association analysis.
Items: These are the objects in the population between which the association analysis examines relationships. When it comes to items in a supermarket, this often involves over 10,000 units. For e-commerce providers such as Amazon, however, shopping basket analysis quickly becomes a big data project with several hundred million products.
Item Set: This refers to a combination of items - usually to express that they have appeared or been purchased together with a certain frequency (e.g. {milk, bread, butter}). The frequency of the set also determines whether an association rule can be derived from it.
Support: More meaningful than the absolute frequency is its ratio to all transactions, i.e. the relative frequency. For the binary decision as to whether an item set is considered frequent, a corresponding threshold value of the minimal support be set.
Confidence: This key figure corresponds to the relative frequency with which different item sets occur together or are found in a transaction. To calculate the confidence, the frequency of the set (or its support) is divided by the frequency of an individual item in the association analysis. This results in values up to a maximum of 1 - in this case there would only be common transactions.
Lift: Not every association rule has a high information content. To map this, the confidence of a rule is divided by an expected confidence value. This results in a metric that indicates whether a data-based finding is of particular relevance. A lift of 1 means that the rule corresponds to the statistical expectation. The higher the lift, the more significant the established correlation.
Derive association rules
All the key figures shown are ultimately used to derive meaningful association rules. These have the typical form X → Y(Support, Confidence) and thus indicate the probability that item Y will be added to item X. Linguistically, this can be expressed in hypotheses such as "If (premise)... then (consequence)". In order to achieve this goal of association analysis, several steps are necessary, which are based on a structured data set of item sets.
- Identify frequent item set
Each rule begins with a suspicion or hypothesis. A simple frequency count of various item sets can be used for this, which can also be formed from parts of existing transactions/sets. If, in addition to a certain frequency, there is also a high Support before, confirms the suspicion that X → Y. In this way, an algorithm would filter out as many conspicuous item sets as possible.
- Examine rule
The Confidence now forms the next examination instance and reveals more about the accuracy of a rule. Here you can sort out again so that only correlations with a confidence level of a desired low difference to the value 1 remain. Just because a rule is established does not mean that it is relevant for the planned investigation.
- Evaluate and apply rule
All that remains are association rules that correspond to the previously defined metrics. Often, however, one is aware of some correlations from the outset, so that a gain in knowledge only arises from an unexpectedly high effect strength. For this assessment, the Lift, which merely compares the confidence level with the expectation. If this is exceeded, business processes, inventories or supply chains can be adjusted accordingly.
Example
The following list of shopping baskets has resulted from transactions in a supermarket:
Transaction ID | Items |
1 | Bread, milk |
2 | Bread, eggs, beer |
3 | Milk, eggs, cola |
4 | Bread, milk, eggs, beer |
5 | Bread, milk, eggs, cola |
Even a simple frequency count shows that, for example, the item set {bread, milk} appears quite frequently, namely in three out of five transactions. The minimum support is Smin=50%.
The support of {bread, milk} S = ⅗ = 60% > 50%. So there is actually a frequency of interest for the investigation. We therefore assume the rule bread → milk (60%, C%). Confidence C is still unknown.
We want to change this and divide the support of {bread, milk} by the support of {bread}: C = 60% / 80% = 75%. The complete association rule is thus Bread → Milk (60%, 75%). This seems convincing, as three quarters of bread purchases are also accompanied by milk.
In order to take measures to increase sales, the supermarket operators only want to consider the most meaningful rules. The items bread and milk have often been seen together on the checkout conveyor belt and a confidence level of 60% was assumed. This results in a Lift L = 75% / 60% = 1.25 > 1. time to move the bread shelf in the direction of the milk.
Three common algorithms
Of course, real-life applications are much more complex and can hardly be solved by such manual calculations. This is why association analysis is usually performed by appropriate algorithms. This allows significantly larger volumes of data to be analyzed, not to mention correlations between extensive item sets.
Apriori
The Apriori algorithm is one of the first of its kind and is still frequently used today. This is due in particular to its ease of use and implementation. In addition to the database, the only necessary inputs are the minimal support and the minimum confidence. In accordance with the procedure described, the program identifies all frequent item sets in the data and filters out rules that match the input. A special feature here is the consideration of the so-called a priori principle. This states that frequent item sets only contain frequent items. If this is not the case, the program cleans up the data using so-called Pruningwhich optimizes the selection quality.
FP-Growth
As a further development of Apriori, FP-Growth is able to cope with rapidly growing data volumes and increases scalability and speed. A side effect, however, is a more complicated application. The original item sets are structured using a so-called Frequent Pattern Tree, whose connecting nodes each represent an item. This compression offers advantages in data collection and storage, which would require countless scan runs with Apriori. The matching item sets of transactions are each given a common prefix with the root as branches of the tree. This helps the algorithm to ultimately filter out all frequent patterns.
ECLAT
Equivalence Class Clustering and Bottom-Up Lattice Traversal is another modern algorithm that is in no way inferior to FP-Growth. It also analyzes an independently generated data set instead of repeatedly scanning the original data. This is a vertical format, i.e. the items would be listed in a table on the left and the respective transactions would be assigned on the right. These so-called Tidsets are analyzed by ECLAT to form item pairs with new, matching tidsets. The more common transactions these comprise, the more likely it is that there is a connection between the items.
Areas of application
The most common use case of the method is hardly surprising at this point: Association rules are used throughout the retail industry to analyze shopping baskets in order to facilitate purchasing decisions and increase sales through cross-selling. However, this is by no means the only benefit of rule-based data mining. The following areas of application particularly benefit from association analysis:
Medicine
It is often not initially known exactly which characteristics may be risk factors or indicators of a particular disease. With the help of association rules based on health data, diagnostics and prevention can be facilitated. Using the algorithms described above in conjunction with natural language processing Indian researchers were able to show how such correlations can be extracted.
UX design
User experience is a key aspect of websites and other digital products. The aim is to make the application and navigation as pleasant and simple as possible so that users find exactly what they are looking for. This can be facilitated by association analysis based on historical usage data, for example by adapting buttons and links. After all, what good is a shopping basket analysis if the path to the online store is too complicated?
Warehouse management
In large warehouses, the position of the items plays a decisive role in the efficiency of the company. Large orders should ideally be processed as quickly as possible, and the same applies to frequently requested goods and frequent sequences. Apriori and similar algorithms can make a significant contribution to minimizing the distances required for this.
Intelligent support for data mining with Konfuzio
The versatility of the possible applications makes it clear that almost every company of a certain size can benefit from data mining. However, methods such as association analysis first require a highly structured and high-quality database. The initial situation is often different: Image and text files, emails, PDFs etc. characterize many processes. That is why the AI platform Konfuzio supports companies in all necessary steps until effective knowledge is gained.
- Extract and structure data
Konfuzio knows how to deal with all the formats mentioned using different technical approaches. This includes Text recognition, Image processinga Low-code integration for e-mail extraction and more. Contained data is accurately extracted, cleansed and prepared in structured files. This creates a valuable basis for data mining.
- Data analysis and processing
Subsequently, Konfuzio allows a highly automated analysis of the data obtained by using concepts of artificial intelligence and data mining. Various integrated Models and algorithms are available. The extracted information can also be migrated to external tools in order to carry out highly individualized (association) analyses.
Conclusion
With the help of an association analysis, relationships between objects, items or articles can be determined in a simple way. The most important factor here is the frequency with which different combinations occur. The data mining method is therefore particularly well suited to examining purchases as part of a shopping basket analysis. Frequently used algorithms are Apriori, FP-Growth and ECLAT, which are also used for various other applications based on structured data. This creates insights that promote the sustainable conservation of resources or an increase in sales.
Would you like to find out more about data mining, the benefits of these processes for companies and how Konfuzio can accompany you on this journey? Write us a message.