Data Mining refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. The kinds of patterns that can be discovered depend upon the data mining tasks employed. By and large, there are two types of data mining tasks: descriptive data mining tasks that describe the general properties of the existing data, and predictive data mining tasks that attempt to do predictions based on inference on available data.
Data Mining is the discovery of hidden information found in databases. Data mining functions include clustering, classification, prediction, and link analysis (associations).
Data mining is often defined as finding hidden information in a database. It has been called exploratory data analysis, data driven discovery, and deductive learning. Data mining access of database differs from traditional access in several ways. The query might not be well formed. The data accessed is usually a different version from that of the original database. The output of the data mining query probably is not a subset of the database. Data mining algorithms can be characterized according to model, preference, and search. Model means the algorithm is fit a model to the data .Preference is some criteria must be fit one model over another.
All algorithms require some technique to search the data. The model can be either predictive or descriptive in nature. A predictive model makes a prediction about values of data using known results found from different data. Predictive modeling may be made based on the use of the other historical data. For example, a credit card use might be refused not because of the user’s own credit history. Predictive model data mining tasks include classification, regression, time series analysis, and prediction. A descriptive model identifies patterns or relationships in data. It serves as a way to explore the properties of the data examined. Clustering, summarization, association rules, and sequence discovery are usually descriptive in nature. Data mining is the principle of analyzing large database and picking out relevant information.
Data mining should have been appropriately called âKnowledge mining from dataâ, which is somewhat long. However, the shorter term knowledge mining may not actually reflect the emphasis on mining from large amount of data.
Many people treat data mining as a synonym for another popularly used term knowledge discovery from data or KDD, while others view data mining as merely an essential step in the process of knowledge discovery. The knowledge discovery process is shown in the figure 1.0 as an iteration of the following steps:-
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base.
Figure 1.0 Knowkedge Discovery from Data or KDD steps
One of the popular descriptive data mining techniques is Association rule mining (ARM), was first introduced in Agrawal, R., Imielinski, T., and Swami, A. N. 1993, owing to its extensive use in marketing and retail communities in addition to many other diverse fields. Mining association rules is particularly useful for discovering relationships among items from large databases.
The initial research was largely motivated by the analysis of market basket data, the results of which allowed companies to more fully understand purchasing behavior and, as a result, better target market audiences. To obtain the association rules, frequent items are first obtained.
Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many kinds of frequent patterns, including frequent itemsets, frequent subsequences (also known as sequential patterns), and frequent substructures. A frequent itemset typically refers to a set of items that often appear together in a transactional data setâ”for example, milk and bread, which are frequently bought together in grocery stores by many customers. A frequently occurring subsequence, such as the pattern that customers, tend to purchase ï¬rst a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured
pattern. Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.
Association rule mining deals with market basket database analysis for finding frequent itemsets and generate valid and important rules. Various association rules mining algorithms have been proposed in 1993 by Aggrawal et. al. [1,2] viz.
Frequent itemsets mining is a major field of study in data mining. Its application varies from association rule mining, correlations, graph pattern determination and various other data mining tasks. The existing of several algorithms both classical and recently developed makes it interesting to determine the suitable and most efficient algorithm to use. The major challenge encountered in the frequent itemset mining is generation of large result set which is accompanied by huge memory consumption. This is mainly associated with the fact that if the threshold set is relatively low, an exponentially large number of itemsets are generated and some algorithms take up much memory and take longer time for generation of the frequent itemsets. The main motivation behind this thesis is to compare the classical algorithms -which provide a base for mining frequent itemsets, with the newly proposed and state of the art algorithms to see if there is a reltive improvement in the field of mining frequent itemsets.
1.3 PROBLEM STATEMENT
Frequent itemsets mining is an important task in data mining. The problem statement is defined as follows:
âTo determine the most effective and efficient algorithm for finding the frequent itemsets between the classical algorithms and newly proposed algorithmsâ.
The study will give a deeper insight to analyze the algorithms on real data set.
1.4 Organization of the Thesis
The thesis work is divided into different chapters. Chapter one provides the introduction of the thesis work, including in-depth details of the data mining activities and the motivation behind the main work. Chapter two is the literature review section which gives a review of the related work in the same field. Chapter three is where the main work is done. It contains the methodology and technique used in the thesis work. Chapter four is the result and discussion section. It contains the graphical representation of the results obtained and justifiable details of the how the results were obtained. Finally, Chapter five gives the conclusion of the thesis and future work.