Over the last twenty years, there has been an extensive growth in the amount of private data collected about individuals. This data comes from a number of sources including medical, financial, library, telephone, and shopping records. Such data can be integrated and analyzed digitally as it’s possible due to the rapid growth in database, networking, and computing technologies. On the one hand, this has led to the development of data mining tools that aim to infer useful trends from this data. But, on the other hand, easy access to personal data poses a threat to individual privacy. In this chapter we give a brief introduction to data mining, its application and main techniques. We also discuss privacy issues arising in data mining, and the related growing public concern.We discuss about the various modification techniques discussing the pros and cons of each making a comparative study. We also have discussed the challenges related to the privacy preservation which has ultimately helped in the development of the of the model in the last.
Data mining is a technique that deals with the extraction of hidden predictive information from large database. It uses sophisticated algorithms for the process of sorting through large amounts of data sets and picking out relevant information. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. With the amount of data doubling each year, more data is gathered and data mining is becoming an increasingly important tool to transform this data into information. Long process of research and product development has resulted in the evolution of data mining. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:
‘ Massive data collection
‘ Powerful multiprocessor computers
‘ Data mining algorithms
1.2 Comparison with Traditional Data Analyses
Often there is confusion among the end users about the differences between a traditional data analysis and data mining. A basic difference is that, unlike traditional data analyses, data mining does not require predefined assumptions. A difference between data mining and traditional data analysis is illustrated with an example as follows [1, 2, 16, 11].
A regional manager of a chain of electronics stores may use traditional data analysis tools to investigate the sales of air conditioners in different quarters of the year. The regional manager can also use traditional data analysis tools to analyse the relationship between the sales of computers and printers in different stores of his region. Both of these scenarios have one thing in common: an assumption. In the first scenario the regional manager assumes that the sales volume of air conditioners depends on the weather and temperature. In the second scenario he assumes a relation between the sales of computers and printers. An end user needs some sort of assumption to formulate a query in traditional data analysis. Conversely, data mining lifts such barriers and allows end users to find answers to questions that can not be handled by traditional data analysis. For example, the regional manager could look for answers to questions such as: Why the sales of computers in few stores are not as high as the sales of computers in other stores? How can I increase the overall sales in all stores? Among all traditional data analyses, statistical analysis is the most similar one to data mining. Many of the data mining tasks, such as building predictive models, and discovering associations, could also be done through statistical analysis. An advantage of data mining is its assumption free approach. Statistical analysis still needs some predefined hypothesis.
Additionally, statistical analysis is usually restricted to only numerical attributes while data mining can handle both numerical and categorical attributes. Moreover, data mining
Techniques are generally easy to use. A combination of data mining and statistical techniques can produce a more efficient data analysis.
1.3 Scope of data mining
Data mining gets its name from the similarities between finding for important business information in a huge database ‘ for example, getting linked products in gigabytes of store scanner data ‘ and mining a mountain for a vein of valuable ore. These processes need either shifting through a large amount of material, or intelligently searching it to find exactly where the value resides. Data mining technology can produce new business opportunities by providing these features in databases of sufficient size and quality, automated prediction of trends and behaviors. The process of finding predictive information in large databases is automated by data mining. Questions that required extensive analysis traditionally can now be answered directly from the data ‘ quickly with data mining technique.
A typical example is targeted marketing. It uses data on past promotional mailings to recognize the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
‘ Automated discovery of previously unknown patterns. Data mining tools analyze databases and recognize previously hidden patterns in one step. The analysis of retail sales data to recognize seemingly unrelated products that are often purchased together is an example of pattern discovery. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying data that are anomalous that could represent data entry keying errors.
Data mining techniques can provide the features of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. On high performance parallel processing systems when data mining tools are used, they can analyze huge databases in minutes. Users can automatically experiment with more models to understand complex data by using faster processing facility. High speed make it possible for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions.
1.4 Common Technologies used in Data Mining
The most commonly used techniques in data mining are:
‘ Artificial neural networks: Non-linear predictive models that learn through training and
Resemble biological neural networks in structure.
‘ Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).
‘ Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
‘ Nearest neighbor method: A technique that classifies each record in a dataset based on a
Combination of the classes of the k record(s) most similar to it in a historical dataset. Sometimes called the k-nearest neighbor technique.
‘ Rule induction: The extraction of useful if-then rules from data based on statistical significance.
1.5 Applications of Data Mining
Due to the development of information processing technology and storage capacity huge amount of data is being collected and processed in almost every sector of life. Business organizations collect data about the consumers for marketing purposes and improving business strategies, medical organizations collect medical records for better treatment and medical research, and national security agencies maintain criminal records for security purposes.
Supermarket chains and departmental stores typically capture each and every sale transaction of their customers. For example, Wal-Mart Stores Inc. captures sale transactions from more than 2,900 stores in 6 different countries and continuously transmits these data to its massive data warehouse, which is the biggest in retail industry, if not the biggest in the world [3,12, 14,15]. According to Teradata, Wal-Mart has plans to expand its huge warehouse to even huger, allegedly to a capacity of 500 tera bytes . Wal-Mart allows more than 3,500 suppliers to access its huge data set and perform various data analyzes.
For successful analyzes of these huge sized data sets various data mining techniques are widely used by the organizations all over the world. For example, Wal-Mart uses its data set for trend analysis . In modern days organizations are extremely dependent on data mining in their every day activities. Data mining techniques extract useful information, which is in turn used for various purposes such as marketing of products and services, identifying the best target group/s, and improving business strategies.
Some Applications of Data Mining Techniques
There is a wide range of data mining applications. A few of them are discussed as follows.
1.5.1 Medical Data Analysis
Generally, medical data sets contain wide variety of bio-medical data which are distributed among parties. Examples of such databases include genome and proteome databases. Various data mining tasks such as data cleaning, data preprocessing and semantic integration can be used for the construction of warehouse and useful analysis of these medical databases .
Data mining techniques can be used to analyze gene sequences in order to find genetic factors of a disease and the mechanism that protect the body from the disease. A disease can be caused by a disorder in a single gene, however in most cases disorder in a combination of genes are responsible for a certain disease. Data mining techniques can be used to indicate such a combination of genes in a target sample. Data sets having patient records can also be analyzed through data mining for various other purposes such as prediction of diseases for new patients. Moreover, data mining is also used for the composition of drugs tailored towards individual’s genetic structure .
1.5.2 Direct Marketing
In direct marketing approach a company delivers its promotional material such as leafets, catalogs, and brochures directly to potential consumers through direct mail, telephone marketing, door to door selling or other direct means. It is crucial to relatively precisely identify potential consumers in order to save marketing expenditure of the company. Data mining techniques are widely used for identifying potential consumers by many companies and organizations including People’s Bank, Reader’s Digest, the Washington Post and Equifax .
1.5.3 Trend Analysis
Trend analysis is generally used in stock market studies where the essential task is the so called bull and bear trend analysis. A bull market is the situation where prices rise consistently for a prolonged period of time, whereas a bear market is the opposite situation .Financial institutions require to realize and predict customer deposit and withdrawal pattern. Supermarket chains need to identify customers’ buying trends and association rules (i.e. which items are likely to be sold together). Wal-Mart is one of the many organizations that uses data mining for trend analysis .
1.5.4 Fraud Detection
Fraudulent credit card uses cost the industry over a billion dollars a year [2, 8]. Almost all financial institutions, such as MasterCard, Visa, and Citibank, use data mining techniques to discover fraudulent credit card use patterns . The use of data mining techniques has already started to reduce the losses. The Guardian (September 9, 2004) published that the loss due to credit card fraud reduced by more than 5% in Great Britain in 2003 . Similarly mobile phone frauds are also very common all over the world. According to Ericsson more than 15,000 mobile phones are stolen just in Britain every month . Data mining techniques are also used to prevent fraudulent users from stealing mobile phones and leaving bills unpaid.
1.5.5 Plagiarism Detection
Assignments submitted by the students can be characterized by several attributes. For example, the attributes for a programming assignment can be run time for a program, number of integer variables used, number of instructions generated, and so on. Based on the attribute values the submissions are analyzed through clustering, which groups similar
Submissions together . Submissions that are clustered together can be suspected for plagiarism.
There is a rapidly growing body of successful applications in a wide range of areas as diverse as: analysis of organic compounds, automatic abstracting, credit card fraud detection, financial forecasting, medical diagnosis etc. Some examples of applications (potential or actual) are:
– A supermarket chain mines its customer transactions data to optimize targeting of high value customers
– A credit card company can use its data warehouse of customer transactions for fraud detection
– A major hotel chain can use survey databases to identify attributes of a ‘high-value’ prospect.