Big Data


1.1 Problem Statement
Volume, variety and velocity of data is increasing day by day, it leads to the generation of big data and by using existing techniques it is not easy to process such big volume of data and mine frequent patterns that exist in data .

Big Data is huge in amount, it is also captured at a fast rate and it is ordered or not ordered or some time amalgamation of the above. These factors create Big Data not easy to mine, manage and capture using conventional or traditional methods.

1.2 Aim/Objective
Perform Association Rule Mining and FP Growth on Big Data of E-Commerce Market to find frequent patterns and association rules among item sets present in database by using reduced Apriori Algorithm and reduced FP Growth Algorithm on top of Mahout (an open source library or java API) built on Hadoop Map Reduce Framework.

1.3 Motivation
Big Data refers to datasets whose amount is away from the ability of characteristic database software tools to analyze, store, manage and capture. This explanation is deliberately incorporates and subjective, a good definition of how large a dataset needs to be in order to be considered as big data i.e. we cannot define big data in terms of being big than a certain number of terabytes or thousands of gigabytes. We suppose that as technology advances with time the volume of datasets that would be eligible as big data will also rise. The definition can differ from sector to sector; it is depending up on which kind of software tools are normally present and what size of datasets are general in a particular industry. According to study, today big data in many sectors will range from a few dozen terabytes to thousands of terabytes.
' Velocity, Variety and Volume of data is growing day by day that is why it is not easy to manage large amount of data.
' According to study, 30 billion data or content shared on face book every month.

Issues/Problems while analysing Big Data:
Volume:
' According to analysis, every day more than one billion shares are traded on the New York Stock Exchange.
' According to analysis, every day Facebook saves two billion comments and likes
' According to analysis, every minute Foursquare manages more than two thousand Check-ins
' According to analysis, every minute Trans Union makes nearly 70,000 update to credit files
' According to analysis, every second Banks process more than ten thousand credit card transactions

Velocity:
We are producing data more rapidly than ever:
' According to study processes are more and more automated
' According to study people are more and more interacting online
' According to study systems are more and more interconnected

Variety:
We are producing a variety of data including:
' Social network connections
' Images
' Audio
' Video
' Log files
' Product rating comments

1.4 Background
Big data[5][6] is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data [7]. In 2012, Gartner updated its definition as follows: Big data is the term that can be defined as high velocity, volume and variety of information assets that require new forms of processing to enable enhanced result building, nearby discovery and process optimization [8]. Additionally, a new V "Veracity" is added by some organizations to describe it.
Big data has evolved like a very important factor in the economic and technology field, as Similar to other important factors of invention like hard-assets & human-capital, high in numbers the present economic activity merely could n't take position exclusive of it. We can say that by looking at current position of the departments in the US economic companies have minimum of 200TB data storage on an average if considered(as double as the size of Wal-Mart's data warehouse of US-retailer in 1999) having one thousand workers approximately. In fact many departments have 1peta-byte 'PB' (in mean) data storage per organization. The growth of the big data will be continue to reach to an high extent, due to the modern technologies, platforms and their logical units and capabilities for handling large amount of data and also its large no. of upcoming users.

Utilization of Big Data will turn Out to Be a Key base of Competition and Growth for Individual Firms:
Usage of big-data has become an important medium for the leading firms to get better in their data handling. If we consider an example of a retail company, the company can increase its operating margin by 60% approximately by embracing their big data. The chief retailers like UK's TESCO and many more use big-data to keep market revenue-share in their pocket against from their local competitors.
The emergence of big-data also has capability to evolutes new growth opportunities for those companies who have both combine and industry analyzing data. Even the companies who have their data at the mid-point of large info data about the objectives and demands of their users, services, buyers, products & suppliers can be easily analyzed and captured using big-data.
The big-data usage in any firm or company can facilitate the healthy and more enhanced analyzing of data and its outcome, by deploying the big-data in the firm there will be lower prices of product, higher quality and healthy match between the company and customer's need. We can say that the step forward towards the acceptance of big data can improve customer surplus and acceleration of performance along all the companies.


Figure1.1: Types of data generated

Figure1.2: Impact of Big Data
Significance of Big Data:
Government sector:
' The administrator of the Obama has announced the idea of big-data R&d which is very useful to handle the several obstacles and problem which government is facing now a days. Their idea comprised of 84 big-data programs with 6 different departments.
' Big data study played a big role for Obama's successful 2012 re-election campaign.

Private sector:
' Ebay.com uses two data warehouse that consists of 7.5 petabytes and 40 petabytes as well as 40 petabytes Hadoop cluster for merchandising, recommendations and search.
' Everyday, Amazon.com also handles large amount of data (in millions) and its related back-end operations as well as requested queries from third part sellers on an average of more than half million.
' More than 1Million consumer-transactions processes every hour in Walmart, that is put into databases and estimation is done on data.
' Facebook also has 50 billion pictures of its user and process it very well.
' F.I.C.O. that is 'Falcon Credit Card Fraud Detection System' handles and secure 2.1-billion active a/c worlds widely.
' According to estimation, the size of the data stored of business and companies is increasing to the double in every 1.2 years.
' Windermere Real Estate uses various 'GPS signals' from nearly 100 million drivers which help new home seekers to determine their critical times to & from work around the times.

What is Hadoop?
' Hadoop is an open source that is called free software framework or technology for processing the huge datasets for certain kinds of problems on the distributed system.
' Hadoop is an open source piece of software that mines or extracts the sequential and non-sequential big-data for a company and then integrates that big-data with your present business intelligence ecosystem.
' Hadoop works on the most important principle called Map-Reduce (map task & reduce task), the main work of the map-reduce is to divide the input dataset into number of independent pieces which are then processed in the parallel-manner by the map tasks.
' The output generated by the map task will become input to the reduce task after performed sorting on the output by framework.
' A file system is used to store the input and the output of the jobs.
' Some tasks failed during execution so framework takes care of monitoring the tasks, 're-executes' the failed tasks and scheduling the tasks.

History of Hadoop:
' The main or the very important organization from where the history of Hadoop began is none other than best company of the world 'Google'.
' Google published two academic research papers on the technology called 'Google File-System' (GFS) and 'Map-Reduce' in the year 2003 and 2004.
' After some time these two technologies combined together and provided a good plat-form for processing huge amount of data in well-organized or effective manner.
' Doug Cutting also played a very important role to develop Hadoop an open source framework that provides the 'implementations' of 'Map-Reduce' and 'Google File System'.
' Doug Cutting had been working on the elements of open source web search engine called 'Nutch' that completely resembles with the technologies 'Map-Reduce' and 'Google File System' published in the Google's research paper.
' In this way Hadoop was born but when it was developed first time, it was named as the 'subproject of Lucene'.
' After that Apache open source foundation did some changes in it and named as 'Apache's open source framework Hadoop' that can be used for processing Big Data in less time.
' Later Doug Cutting was hired by another big company that is yahoo. He and other employees of yahoo contributed a lot to Hadoop but after some time Doug Cutting moved to 'Cloudera' and his other teams was hired by an organization called 'Hortonworks'.
' Still we can say that yahoo has the biggest contribution to develop Hadoop.
What is Apache Mahout?
' Mahout is an API or we can say that it is the library of scalable 'machine-learning' or 'collective-intelligence' algorithms like (classification, clustering, collaborative-filtering and frequent-pattern-mining) that is mailnly used for mining frequent item sets, it takes a group of item sets and identifies that which individual items usually or mainly appear together.
' When the size of data is too large then in such kind of circumstances Mahout is used as the best 'machine-learning' tool because number of algorithms like clustering, pattern mining and collaborative-filtering has been implemented in mahout, it can produce the outputs fast when used on top of Hadoop.

History of Mahout:
' The life of Mahout was started in the year 2008, at this time it was treated as the 'sub-project' of one of the major project of Apache named as 'Apache's Lucene Project'.
' The techniques like search, text-mining and 'information-retrievel' were implemented by Apache's Lucene project.
' Some of the members of the project Lucene were working on the same technology that is 'machine-learning' areas, so these members also contributed to mahout and a separate project named 'Mahout' was generated which works on the principle to predicts future on the basis of past..
' The algorithms implemented in Mahout were not only implemented in the conventional way but also implemented in such a way that Mahout Framework and algorithms could easily process the large amount of data while working on top of Hadoop using Mahout.
Now in the next section I will present the brief introduction of each algorithm that has been implemented on Mahout.

Collaborative Filtering:
' Collaborative-Filtering is the technique of filter out some important data from the large amount of data which user browse, preference and rate, in other words we can say that collaborative filtering is the process of generating predictions on the basis of users past behavior or history and suggest or recommend users the top most predicted data or top 'N' recommendations so that it might be helpful for user in his/her future decisions.
' Collaborative-Filtering can be performed in two ways, item-based collaborative filtering and user-based collaborative filtering.
' User-based collaborative filtering is the technique that find neighbors having similar taste like user from the large amount of user preferences database then suggest or generates the recommendations for user but like and dislike of user is not static so the recommendations generated using this technique is not so effective and bottleneck problem also occurs so Item-based collaborative filtering algorithm is used these days to generate recommendations for a user because it removes the problem of bottleneck and it first finds the items having similar relationship that user has liked from large pool of items and then generate the recommendations.
' Item based collaborative filtering works on the principle that similarity among item remains static but user likes and dislikes may change so this technique generates good quality of recommendations as compared to user-based collaborative filtering algorithm.

Association-Rule-Mining
' Association-rule-mining is the technique used to find some rules on the basis of which the growth of an organization can be increased.
' There are number of algorithms on the basis of which we can find frequent patterns from the large of dataset, on the basis of frequent patterns we can generate some rules that would be really helpful to increase the turnover of an organization.

Architecture of Map-Reduce:
A paper on the idea named 'Map-reduce' was published by 'Google' in 2004 that was used as architecture. Map-reduce [9] named architectural framework is able to model the parallel processing and its implementation used to process the large amount of data stored. Using this technology, the requested query is splitted into sub queries and then distributed among several parallel sites and processed parallel which is called the 'Map-step'. Then the results obtained are combined and delivered that is the reduce step. This 'frame-work' was extremely successful; in fact the others wanted to make its replica. Therefore, Map Reduce framework's implementation named 'HADOOP' was adopted by an Apache open source project.


Figure1.3: MAP-Reduce Flow

Existing Techniques and Technologies:
The several technologies have been adapted and developed to manipulate, analyse, visualize and aggregate huge quantity of data. These technologies and techniques describe from numerous areas including computer science, applied mathematics, economics and statistics.

A number of technologies and techniques were developed in the world having access to smaller variety and volumes in data but they have been effectively adapted so that they could be valid to very big sets or more dissimilar data.
Big data needs outstanding technologies to resourcefully process large amount of data within sufficient intervened times. A 2011 report on big data suggests suitable Big Data techniques include:
A/B Testing: It is a technique in which the comparison between control group and different test groups is done to obtain the answer that what changes can improve the aim of it.
Association Rules: A set of these techniques is used find significant relationships that are association rules between identifiers of huge data storage. Numbers of algorithms are present inside this technology to produce and test feasible rules.
Classification: A technique which is mainly used to classify the items present in the dataset and usually used to predict the nature of a class using other attributes.
For Example: Prediction of weather on the basis of previous day's weather
Cluster Analysis: It is the technique used to group the number of objects having similar properties into one cluster and other objects having similar properties with each other but dissimilar to other cluster group into one cluster. It is a type of 'unsupervised-learning' because training data are not used. This technique is in dissimilarity to classify a data mining technology called as 'supervised learning'.
Data combination and Data Integration: These are the techniques that gather the data from several locations then analyze the data for producing good understandings in such a way that is much effective and possibly more precise.
Machine Learning: A branch or part of computer_science generally called artificial intelligence that is related with the design and improvement of algorithms, which allow computer systems to develop activities on the basis of realistic data.
Natural language processing: NLP is the technique to process natural language using collection of techniques from the field of computer science which is named as AI & linguistics & also consists of number of algorithms to analyze human or natural language.
Sentiment Analysis: This is an application of natural-language-processing that is (NLP) and other critical techniques to recognize and mine the knowledge from inputs. Some important aspects of its examination comprises of identifying the product, aspect and feature.
Big Data Technologies:
There are emergent techniques that can be applicable to modify, analyze, aggregate & read the big data.
Big Table: Big table is the PDDS i.e. proprietary distributed-database-system built on the GFS i.e. 'Google-File-System' Encouragement for HBase.
Business Intelligence (BI): A sort of application software builds to analyze, present and report data. Business Intelligence tools normally analyze data which is earlier stored in 'data_mart' or 'data_warehouse'.
Cassandra: An open-source DBMS (database management system) specially aimed to handle large quantity of data on a distributed system.
Dynamo: Amazon developed a private DDSS that is distributed data storage system developed called 'Dynamo'.
Google File System: This is a private distributed file system developed by Google as a part of the motivation for Hadoop.
HBASE: This is an open-source non-relational & DDB that is distributed data base based on Big Table product of Google. This project was initially developed by Power set but now it is managed by the Apache Software Foundation as part of the Hadoop.
Map Reduce: This is a software framework developed by the best company of the world that is 'Google' for handing out large datasets for specific kinds of requested queries on data stored at distributed sites.
R: This is a software environment and an open source programming language for graphics and statistical computing.
Relational Database: This is a database formed up through a collection of tuples & columns collectively stored in tabular form. RDBMS i.e. Relational Database Management Systems is the database system consists of structured-data and stored in form of tuples and columns. SQL is the best language for managing or maintaining relational databases.

CHAPTER 2
Literature Review:
1] A.Pradeepa, A.S.Thanamani. 'Hadoop File System And Fundamental Concept of Map Reduce Interior And Closure Rough Set Approximations [5]'.
In this paper authors described that big data-mining and knowledge-discovery is the huge challenges because the volume or size of data is growing at an unprecented approximation scale. Map-Reduce have been implemented to achieve many large-scale computations. Recently introduced or presented Map-Reduce proficiency has received or gained much more consideration and attention from both the sides that is the industry for its applicability and the scientific-community in big-data analysis .According to the authors of this paper, for mining and finding some knowledge from big data, they presented an algorithm corresponding to the Map-Reduce based on abrasive theory, that are put forward to deal with the massive or large amount of data and also measured the performances on the large or big data sets to show that the proposed work can effectively or accurately processes the big data and find the results in less time.

2] Md.R.Karim1, A.Hossain, Md.M.Rashid. 'An Efficient Market Basket Analysis Technique with Improved Map Reduce Framework on Hadoop [6]'.
In this paper, authors described that market-basket analysis techniques are considerably important to every day's business decision because of its capability of mining customer's purchase rules by discovering that which items they are buying so frequently and together. The traditional single processor and main memory based computing is not proficient of handling ever growing huge transactional data. In this paper an effort has been taken to remove these limitations. First author will eliminate null transactions and rare items from the segmented dataset before applying their proposed HMBA algorithm using the ComMap-Reduce framework on Hadoop to generate the absolute set of maximal frequent item-sets.

3] J.W. Woo, Yuhang Xu. 'Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing [7]'.
In this paper, authors explained the Map-Reduce approach that has been very popular or effective, in order to compute or calculate enormous volumes of data, since google implemented its platform on google distributed file systems that is called (G.D.F.S) and Amazon web service that is called (A.W.S), provides its services with a platform called Apache Hadoop.

4] J.W. Woo, S.Basopia, Yuhang Xu. 'Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop [8]'.
In this paper authors presented a new schema that is called H-Base which is used to process transaction data for market basket analysis algorithm. Market basket analysis algorithm runs on apache Hadoop Map-Reduce and read data from HBase and HDFS, then the transaction data is converted and sorted into data set having (key and value pair) and after the completion of whole process, it stores the whole data to the H-Base or Hadoop distributed file system that is HDFS.

5] D.V.S.Shalini, M.Shashi and A.M.Sowjanya. 'Mining Frequent Patterns of Stock Data Using Hybrid Clustering[9]'.
In this paper, authors described that the classification and patterns in the stock market or inventory data is really significant or important for business-support and decision-making. They also proposed a new algo that is algorithm for mining patterns from large amount of stock market data for guessing factors that are affecting or decreasing the product's sale. To improve the execution time, the proposed system uses two efficient methods for clustering which includes PAM that is Partitioning Around Medoids and (BIRCH) that is Balanced Iterative Reducing and Clustering using Hierarchies along with (M.F.P). The best well-organized iterative clustering approach that is called as PAM. PAM is used initially or to start the clustering and after that PAM was combined with frequent pattern mining algorithm that is called FPM algorithm.

6] W.Wei, S.Yu, Q.Guo, W.Ding and L.Bian. 'An Effective Algorithm for Simultaneously Mining Frequent Patterns and Association Rules[10]'.
According to the authors of this paper algorithms like Apriori and FP -Growth break the problem of mining association rules into two different sub problems then find frequent pattern and generate the required rules. To solve the problem we catch a deep insight of FP-Growth algorithm and propose an effective algorithm by using the FP-tree called AR-Growth Association Rule Growth which can concurrently discover frequent item sets and association rules (AR) in a large database.

7] J.W.Woo. 'Apriori-Map/Reduce Algorithm [11]'.
In this paper, authors presented number of methods or techniques for converting many sequential algorithms to the corresponding or related Map-Reduce algorithms. They also described that Map-Reduce algorithm of the legacy Apriori algorithm which has been common or same to collect the item sets frequently, arose to compose association rules in data mining. Theoretically it shows that the proposed algorithm provides the high performance computing depending upon the number of Map-Reduce nodes.

8] L.Hualei, L.Shukuan, Q.Jianzhong, Y.Ge, L.Kaifu. 'An Efficient Frequent Pattern Mining Algorithm for Data Stream [12]'.
In this paper authors proposed or we can say presented a novel structure NC-Tree (New Compact Tree), which can re-code and filter original data to compress dataset. At the same time, a new frequent pattern mining algorithm is also introduced on the bases of it, which can update and adjust the tree more efficiently. There are mostly two kinds of algorithms that is basically used to mine frequent item sets using frequent pattern mining approach. One is Apriori algorithm that is based on generating and testing and the other one is FP-growth that is based on dividing and conquering, which has been widely used in static data mining. For data stream, the frequent pattern mining algorithms must have strong ability of updating and adjusting to further improve its efficiency.
9] S.K Vijayakumar, A. Bhargavi, U. Praseeda and S. A. Ahamed. 'Optimizing Sequence Alignment in Cloud using Hadoop and MPP Database'[Sequence Alignment].
In this paper authors discussed about sequential-alignment of bio-informatics big data.
The size of data is growing day by day in the field of bio-informatics so it is not easy to process and find important sequences that are present in bio-informatics data using existing techniques. Authors of this paper basically discussed about the new technologies to store and process large amount of data that is 'Hadoop' and 'Green-plum'. Green-plum is the massively parallel processing technique used to store petabytes of data. Hadoop is also used to process huge amount of data because it is also based on parallel processing and generates results in very less time as compared to existing technologies to process the huge amount of data. Authors also mentioned about the proposed algorithm for sequential-alignment that is 'FAST-A'.

10] S.Mishra, D.Mishra and S.K.Satapathy. 'Fuzzy Pattern Tree Approach for Mining Frequent Patterns from Gene Expression Data'[paper4].
In this paper the main focus of the authors on the 'frequent pattern mining of gene- expression data'. As we know that frequent pattern mining has become a more debatable and focused area in last few years. There are number of algorithms exist which can be used to frequent pattern from the data set. But in this paper authors applied the fuzzification technique on the data set and after that applied number of techniques to find more meaningful frequent patterns from data set.

11] L.Chen, W.Liu. 'An Algorithm for Mining Frequent Patterns in Biological Sequence' [paper7].
In this paper authors describe that the existing techniques used to mine frequent patterns from large amount of biological data is not efficient and time consuming. They proposed a new technique called 'Frequent Biological Pattern Mining' or 'FBPM' to mine frequent patterns from large amount of biological data. They also compared the results of both the techniques that are existing techniques and proposed techniques on the basis of execution time to find frequent patterns and number of patterns mined.
12] B.Sarwar, G.Karypis, J.Konstan, and J.Riedl. 'ItemBased Collaborative Filtering Recommendation Algorithms'.
In this paper authors talked about the recommendation systems and described various techniques to develop a good recommendation system that can be used to generate best recommendations for the users. Recommendation systems are the system with the help of which we can predict the future after applying some collaborative-filtering algorithms, on the basis of users past activities. Two most famous collaborative-filtering techniques that we generally use to predict the future data that would be helpful for user on his/her next purchase are Item-based-collaborative-filtering algorithm and User- based-collaborative-filtering algorithm. Item-based- collaborative-filtering algorithm works on the principle of comparing the similarities between items and suggests those items to the users which are quite similar to his/her taste. On the other hand user-based-collaborative-filtering algorithm works on the principle of finding nearest-neighbours of target user which agrees on the similar item in terms of rating or having some similarity in items, find nearest users having similar taste to the target user and suggest those items to the target user which are liked by his/her nearest-neighbours.

CHAPTER 3
Design and Implementation
3.1 Proposed Methodology:
According to our dissertation title we are working to find frequent patterns and on the basis of the frequent patterns some recommendations would be suggested to the user using frequent pattern mining algorithm, Hadoop and Mahout.
1. First of all my main work is to collect the real time data set of E-commerce website.
2. Once the data set has been collected, next step is to clean the data set. Cleaning of dataset means remove the unwanted fields and convert the format of dataset into desired format.
3. After converting the dataset into a meaningful format, make a java program that can read the dataset and generate frequent patterns and association rules from the data.
4. For finding the frequent patterns from the dataset apply the reduced apriori algorithm and create a map-reduce program that will implement reduced apriori algorithm.
5. Combine the program with Hadoop to find the frequent patterns in less time as compared to find the frequent patterns by executing program in eclipse.
6. Apply the dataset using mahout on top of Hadoop in distributed environment to find recommendations by using collaborative filtering approach.
7. Compare the execution time of finding frequent patterns and association rules using (Hadoop, Mahout) and simple java program.

3.2 Proposed Architecture:

Figure3.1: Proposed Architecture
3.3 Implementation
Technology used:
What is Hadoop?
'Hadoop is an open source that is called free software framework or technology for processing the huge datasets for certain kinds of problems on the distributed system.
'Hadoop is an open source piece of software that mines or extracts the sequential and non-sequential big-data for a company and then integrates that big-data with your present business intelligence ecosystem.
'Hadoop works on the most important principle called Map-Reduce (map task & reduce task), the main work of the map-reduce is to divide the input dataset into number of independent pieces which are then processed in the parallel-manner by the map tasks.
'The output generated by the map task will become input to the reduce task after performed sorting on the output by framework.
'A file system is used to store the input and the output of the jobs.
'Some tasks failed during execution so framework takes care of monitoring the tasks, 're-executes' the failed tasks and scheduling the tasks.

Parts of Hadoop:
' There are numbers of sub-components in the Hadoop but two core or main components of Hadoop are 'Map-Reduce' (Used for processing) and 'Google File System (GFS)' or 'Hadoop Distributed File System (HDFS)' (Used for storage).
' 'Hadoop Distributed File System (HDFS)' and 'Map-Reduce' are designed to keep this thing in mind that they both could be deployed on the single cluster and so that both processing system and storage system could work together.

Hadoop Distributed File System (HDFS):
' HDFS stands for 'Hadoop Distribute File System', this is a file system basically used by Hadoop having distributed nature and produce high output per unit time.
' HDFS is the distributed file system because it distributes the data across number of nodes so that in case of failure data could be easily recovered.
' HDFS stores the data on number of data nodes after dividing the whole data into number of data blocks.
' The default block size is 64 MB but it is configurable according to the size of data is going to process using Hadoop.
' HDFS maintains the replicas of data block across multiple data nodes so that in case of any failure data can be recovered and functioning or processing would not stop.

Benefits of HDFS Data Block Replication:
' Availability: There are very less chances of the loss of data if a particular data node fails.
' Reliability: There are number of replicas of data so if in any case data at particular node corrupts then it can be corrected easily.
' Performance: Data is always available for reducer for processing because multiple copies exist so this is the main reason it also increases the performance.

Nodes in HDFS:
' Name Node:
' Name node is the master node among all the nodes and maintains or allocates the name of all the data blocks created by HDFS.
' Name node is also very helpful to manage the number of data blocks present on the data node.
' Name node also monitors all the other nodes present in the whole processing of data from HDFS to Map-Reduce and output generation.
' Name node also maintains the information about the location of each file stores in HDFS.
' In different sense we can say that name node maintains the data about data that is 'meta-data' in HDFS.
' For example:
File A is present on Data Node 1, Data Node 2, and Data Node 4.
File B is present on Data Node 3, Data Node 4, and Data Node 5.
File C present on Data Node 1, Data Node 4, and Data Node 3.
' Name node is only point of disappointment for the Hadoop Cluster.

' Data Node:
' Data nodes are the slave nodes of the master node i.e. name node and is generally used by HDFS for storing the data blocks.
' Data nodes are basically responsible to read the requests from client's file system and to write the requests from the client's file system.
' Data nodes also perform a very important function of creating the data blocks, replicated the data blocks on the basis of instructions provided by name node.

' Secondary Name Node:
' Secondary name node generally performs the functions of throwing out the non-important things present inside the Name node regularly to prevent it from failure.
' Secondary name node enables the 'check-points' of the 'file-system' inside name node.
' Secondary node is the backup for name node rather than it is a saying that it is a point of failure for name node.

Map-Reduce:
' Map-Reduce are the kind of framework or we can say it is a programming concept used by Apache organization in its product Hadoop, for processing large amount of data.
' Map and Reduce functions have almost created in every programming language having the same functionality Java language.
' Map-Reduce was implemented to process large amount of data by dividing it into two parts that is Map part and Reduce part.
' Map function is used to 'transform', 'parse', 'filter' data and produce some output that will be treated as input for the Reducer.
' Reduce function takes the output generated from the Map function as its input and sorts or combine the data for decreasing the complexity.
' Map and Reduce functions both works on the principle of (Key and Value).
' Map function takes the input from data node and with help of mapper divide the data into keys and values like (key1, value11), (key2, value12) and (key1, value21), (key2, value22).
' Combiner then summarizes the data and combine the values relate to a particular key (key1, <value11, value21>) and (key2, <value12, value22>).
' Reduce function then reduce the output generated by combiner and generates the final output (key1, value1), (key2, value2).

Nodes in Map-Reduce:
The complete Map-Reduce operation comprises of two important jobs that are 'Job-Tracker' which is known as the 'Master Node' and 'Task-Tracker' which is known as the 'Slave Node'.

' Job-Tracker:
' Job-Tracker usually takes the request or tasks from the client and forward or assign these requests or tasks to the Task-Tracker on top of data node, then Task-Tracker performed the tasks with the help of data node.
' Job-Tracker always assigns the tasks to the Task-Tracker on top of data node because data is always available there in the local repository.
' Sometimes it might not be possible for Job-Tracker to assign the tasks for Task-Tracker on data node then Job-Tracker tries to assign the tasks to Task-Tracker in the same rack.
' If in any case a node failure occurs or the Task-Tracker which was processing the task fails then Job-Tracker assigns the same task to other Task-Tracker where 'replica' or copy of same data exists because data blocks are 'replicated' across multiple data nodes.
' In such a way Job-Tracker guarantees that if Task-Tracker stops working then it does not mean that the job fails.

' Task-Tracker:
' Task-Tracker is the slave node of the master node i.e. 'Job-Tracker' which takes the request of processing the tasks from Job-Tracker and processes the task ('Map', 'Reduce' and 'Shuffle') using data present in data block of data nodes.
' Every 'Task-Tracker' is composed of number of slots that it means it can process number of tasks at the same time.
' Job-Tracker always checks that an empty slot is present on the same server whenever some task should be scheduled, if empty slots exist that can hosts the data node having data then Job-Tracker give the task to that slot of Task-Tracker otherwise Job-Tracker looks for the empty slot on the same rack of the machine.
' Each Task-Tracker sends some message in every second to inform the Job-tracker that it is alive and processing the task.
' Each Task-Tracker has its own JVM to process the task if in any case one Task-Tracker stops working it would be informed to Job-Tracker and Job-Tracker allocates some other Task-Tracker for that task and all other Task-Tracker would be working simultaneously without any kind of intervention.
' After the task has been completed the 'Task-Tracker' informs the Job-Tracker.

Figure3.2: Components of Hadoop
What is Apache Mahout?
' Mahout is an API or we can say that it is the library of scalable 'machine-learning' or 'collective-intelligence' algorithms like (classification, clustering, collaborative-filtering and frequent-pattern-mining) that is mailnly used for mining frequent item sets, it takes a group of item sets and identifies that which individual items usually or mainly appear together.
' When the size of data is too large then in such kind of circumstances Mahout is used as the best 'machine-learning' tool because number of algorithms like clustering, pattern mining and collaborative-filtering has been implemented in mahout, it can produce the outputs fast when used on top of Hadoop.
Collaborative Filtering:
' Collaborative-Filtering is the technique of filter out some important data from the large amount of data which user browse, preference and rate, in other words we can say that collaborative filtering is the process of generating predictions on the basis of users past behavior or history and suggest or recommend users the top most predicted data or top 'N' recommendations so that it might be helpful for user in his/her future decisions.
' All the preferences about different set of items from users can come from explicit ratings or from implicit ratings.
' Explicit Rating: It is the kind of technique in which user suggest his/her preference by giving rating to a particular product or item on a certain scale.
' Implicit Rating: It is the kind of technique in which user's preferences are generated on the basis user's interaction for products.
' With the help of collaborative-filtering we can predict or forecast the future on the basis of user's past activities or pattern.
' To predict the future on the basis of past activities of users we firstly make or create the database of the user's preferences for items and then apply some algorithms like nearest neighborhood to predict the future preferences of a user on the basis of his/her neighbors having the same perception or taste.
' As the size of data is increasing on daily basis so that is the main challenge that we require such kind of algorithms that can process millions of data and match a user preference with all other neighbors present in database to get better prediction in less time.
' Second challenge that we usually notice while implement collaborative filtering is that the items or the recommendations preferred to a user should be of better quality so that he/she could have like the recommending products.
' Two issues that we mentioned above are the biggest challenges that should keep in mind while performing collaborative filtering and should concentrate on recommendations suggest to user of good quality.
' Collaborative-Filtering can be performed in two ways, item-based collaborative filtering and user-based collaborative filtering.
' User-based collaborative filtering is the technique that find neighbors having similar taste like user from the large amount of user preferences database then suggest or generates the recommendations for user but like and dislike of user is not static so the recommendations generated using this technique is not so effective and bottleneck problem also occurs so Item-based collaborative filtering algorithm is used these days to generate recommendations for a user because it removes the problem of bottleneck and it first finds the items having similar relationship that user has liked from large pool of items and then generate the recommendations.
' Item based collaborative filtering works on the principle that similarity among item remains static but user likes and dislikes may change so this technique generates good quality of recommendations as compared to user-based collaborative filtering algorithm.

Item-Based Collaborative-Filtering ('Model-Based' Collaborative Filtering)
' Item-based collaborative-filtering algorithm is one of the best algorithms used by recommendation systems, to generate recommendations using this algorithm firstly we make the set of items that user has rated earlier, after that find a set of (n) most similar items {I1, I2'', In} from this set having same similarities to the target item (i) then similarity {SI1,SI2, '., SIn} of each item present inside the set of (n) most similar items is calculated. After the computation of similarities, calculate the weighted average sum of user ratings on set of similar items to find the best recommendations for the target user.
' Prediction computation and similarity computation are the two techniques used to find the future predictions and similarities among numbers of items.

Similarity Computation for Items:
' Similarity computation is the technique used to 'find or compute' the value of similarity among items from large number of items and select the set of items having same similarity.
' Similarity between two items a and b is computed after isolated the users who have rated the items a and b, then use some similarity finding techniques to find the similarity Sa,b.
' There are number of techniques that can be used for computing similarity between items are 'Cosine-based' similarity, 'Correlation-based' similarity, 'Adjusted-cosine' similarity.
Let's we discuss all three techniques to find similarities between items are:

Cosine-Based Similarity
' Cosine-based similarity is the technique used to find the similarity between two items; this technique considers both the items for which similarity would be determined as the two vectors in the n-dimensional user space.
' Similarity is measured as the cosine of the angle between both the vectors.
' Similarity between the two items a and b can be denoted as:
Sim(a,b)= cos( , )=( . )??(|( ^2)|*|( ^2)|)
Here '.' denotes the dot product of both the vectors

Correlation-Based Similarity:
' Correlation-Based Similarity is the other technique used to find the similarity between two items a and b.
' To find the similarity between two items using this technique we use the pearson co-relation method and find the pearson co-relation between two items that is Corr(a,b).
' To find the value of pearson co-relation more accurate, we removed the over-rated ratings of the two items that is the ratings for which users rated both items (a and b) and set of users who rated both items (a and b) are denoted by U.
Pearson correlation formula:

Adjusted-Cosine Similarity:
' Adjusted-cosine-similarity is the other technique to calculate the similarity between items so that it could be used for prediction.
' Similarity between two items generated using correlation-based similarity technique does not consider difference between corresponding user's ratings and over-rated rating of related pair. So the results generated are not so accurate.
' Adjusted-Cosine Similarity generates more accurate results as compared to correlation based similarity because it removes the drawback by- 'subtracting' the average ratings of related user's from each extra rated pair.

User-Based Collaborative-Filtering ('Memory-Based' Collaborative Filtering)
' User-Based Collaborative filtering is the technique or we can say that it is an algorithm that is basically used to generate future predictions for a user on the basis of his/her past history or by using his/her neighbors having similar kind of taste.
' User-Based Collaborative filtering is the algorithm which works on the principle of generating recommendations for the user after finding his/her neighbor users from database of items and users, having similar kind of item ratings assigned and similar type of purchase history.
' After generating the nearest neighbors having same taste for the target user using user-based collaborative filtering algorithm then some techniques are applied to find the top recommendations for the target user.
' In this way user based collaborative filtering generates recommendations and it is also known as memory based or nearest neighbor based algorithm to find or suggest best recommendations for the user.

User-Based Collaborative-Filtering challenges:
1. 'Scalability': As the number of users and items are increasing the size of 'user-item' database is also growing so it takes lot of time to find nearest neighbor of a particular user form large database having millions of users and items exists. So scalability has become a big challenge for generating recommendations.
2. 'Sparsity': Recommendation system which is working on the principle of nearest neighbor fails in certain circumstances, when the number of active users who is purchasing some product is very large, so in such cases to find nearest neighbors for each active user is very difficult because of the sparsity.

Association-Rule-Mining
' Association-rule-mining is the technique used to find some rules on the basis of which the growth of an organization can be increased.
' There are number of algorithms on the basis of which we can find frequent patterns from the large of dataset, on the basis of frequent patterns we can generate some rules that would be really helpful to increase the turnover of an organization.
' Algorithms like Apriori and FP-growth are mostly used to find the frequent patterns and generate association rules but if the size of data is so huge then these two algorithm would take more time to generate rules thus decrease the efficiency of the algorithm.
' So we implemented both the algorithms using map-reduce technique and then implemented them on top of hadoop to find frequent patterns and association rules.
' So Apriori and FP-growth [10] algorithms find the frequent patterns from a set of transactional dataset having transaction id and item set(i.e., {TId: item set}), where TID is a transaction-id and item set is the set of items bought in transaction TId. On the other hand, mining can also be performed on data presented in data format like {Item: TID set}.
' Apriori algorithm significantly reduces the size of candidate sets using the Apriori standard but still it suffers from two problems: (1) generates a huge number of candidate sets, and (2) repeatedly scan the database and checking the candidates by pattern matching.
' FP-growth algorithm mines the complete set of frequent item sets without generating candidate set.
' FP-growth works on the divide and conquers principle.
' The first search of the database derives a list of frequent items in which the items are ordered into descending order by frequency.
' According to the descending list by frequency, the database is reduced into a frequent-pattern tree or fp-tree, which retains the item set and their association information.
' The fp-tree is mined or formed by initializing from each one (frequent length-1) pattern as an first suffix prototype, constructing its conditional pattern base and sub database which consists of the set of prefix paths in the fp-tree co-occurring with the suffix pattern then constructing its conditional fp-tree and performing mining recursively on such a tree.
' The pattern growth is achieved by the combination or concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.
How Association Rule Mining Works
Consider the following small dataset having three transactions to know about how association rules mining works:
T1: laptop, pen drive, speakers
T2: laptop, pen drive
T3: mobile, screen guard, mobile cover
T4: laptop, speakers, pen drive
T5: mobile, mobile cover, screen guard
T6: mobile, screen guard
From this small data set we can find the frequent patterns and on the basis of frequent patterns we can generate some association rules using FP Growth algorithm.
A frequent pattern is a group of some items that occur frequently in the transactions or data set. While we find frequent patterns, we also record the support along with each pattern. Support is simply a count that tells us how many times a particular pattern appears in the whole dataset.
Frequent patterns:
Laptop= 3
Pen drive = 3
Speakers = 2
Mobile = 2
Mobile cover = 2
Screen guard = 2
Laptop, pen drive = 3
Laptop, speakers = 2
Mobile, screen guard = 3
Mobile, mobile cover = 2
Laptop, pen drive, speakers = 2
Mobile, mobile cover, screen guard = 2
Here we assume that support is 2 it means all the patterns with support equal to or greater than 2, will considered as frequent patterns. On the basis of above frequent patterns we can construct some association rules which follow the minimum support and confidence >= 60%.
Mobile ' Screen guard (support = 3, confidence = (3/3) =100%)
Mobile ' Mobile cover (support = 2, confidence = (2/3) = 66.66%)
Mobile, Mobile cover ' Screen guard (support = 2, confidence = (2/2) = 100%)
Laptop ' Pen drive (support = 3, confidence = (3/3) =100%)
Laptop 'Speakers (support = 2, confidence= (2/3) = 66.66%)
Laptop, Pen drive ' Speakers (support = 2, confidence = (2/3) = 66.66%)
On the basis of above small data set we are able to find some association rules that tells us that (screen guard, mobile cover are always buy with mobile phone) and (pen drive, speakers are always buy with laptop.
Chapter 4
Approach to Design
4.1 Flow of Implementation:

Figure4.1: Flow of Implementation
Chapter 5
Experimentation
5.1 Installation of Hadoop:
If you have already installed Ubuntu 12.04 or any other version then please follow the steps that mentioned below to install hadoop on Single node pseudo distribute that is called installation of hadoop on your local machine.

Step 1: Java Installation
To work on Hadoop, first of all we require to install java on your local machine. So install the latest version of java that is Oracle JAVA 1.7 which is highly recommended to run Hadoop. Here I am also using oracle java 1.7 for working on hadoop because it is more stable, fast and have several new APIs.
Following commands are used for Installing JAVA in Ubuntu:
Open the terminal using (ctrl+alt+t) then enter the following commands to install java:
1) Sudo apt-get install 'python-software-properties'
2) Sudo add-apt-repository 'ppa:webupd8team/java'
3) Sudo apt-get update
4) Sudo apt-get install oracle-java7-installer
5) Sudo update-java-alternatives -s java-7-oracle

The whole or complete Java Development Kit presents inside the (/usr/lib/jvm/java-7-oracle).

When the installation has come to an end then do a check whether java or JDK has correctly set up by using command that is mentioned below.


Figure5.1: Java Installed Successfully

Step 2: After successfully installed java, add a separate or personal user for hadoop that is hduser .
Commmands for create 'hadoop- user' and 'hadoop- group':
1)sudo 'addgroup hadoop'
2)sudo ' adduser --ingroup hadoop' hduser

Figure5.2: Create hadoop hduser and group
After the successful completion of above steps, we will have a separate user and group for hadoop.

Step 3: How to configure ssh
To work with hadoop on remote machines or at your local machine, hadoop requires ssh access to deal with its nodes.
Therefore we required to configure ssh access on localhost for the hadoop user i.e. hduser that we created in the last step.
Commands to configure ssh access:
1)sudo apt-get install 'openssh-server'
2)sudo ssh-keygen -t 'rsa 'P'

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost


Figure5.3: Configure ssh localhost
Step 4: Disabling IPv6
To work with hadoop on local or in distributed system we required to disable ipv6 on Ubuntu 12.04.

First of all open the'/etc/sysctl.conf'file in any editor of your choice in ubuntu then add the below mentioned lines at the end of this file.
# Disable ipv6
Net.ipv6.conf.all.disable_ipv6 = 1
Net.pv6.conf.default.disable_ipv6 = 1
Net.ipv6.conf.lo.disable_ipv6 = 1
Now it is the time to restart your system so that the changes that we have done could be reflected. After restarting the system we can confirm that whether ipv6 has disabled on your local machine by using the command that is mentioned below:
'Cat/proc/sys/net/ipv6/conf/all/disable_ipv6' (If return value = '1' it shows that ipv6 has disabled)

Figure5.4: Disabling IPv6
Step 5: After performing the initial steps it is the time to install Hadoop version 1.0.4 on your machine.

First of all download hadoop-1.0.4 tar file that is the stable or good release from apache download mirrors, then extract or untar the hadoop downloaded file to a folder named hadoop at usr/local/hadoop . A folder which is common to all the users, it is mostly chosen that we should install or set-up the hadoop into that folder.

Use the below mentioned commands to untar the hadoop 1.0.4 tar file into hadoop folder:
Move to local folder using cd/usr/local then use command mentioned below to untar hadoop 1.0.4:
1) sudo tar xzf 'hadoop-1.0.4.tar.gz'
2) sudo mv hadoop-1.0.4hadoop

Change the owner of all the files to hadoop group and the hduser user using the command:
1)sudo 'chown -R hduser:hadoop hadoop'

Now it is the time to update the bash file which is present inside home directory $HOME/.bashrc :
To update the bash file i.e.'.bachrc' for the'hduser'. To open'.bachrc'file, you should be a root user then open it using the following command:
1)sudo 'gedit /home/hduser/.bashrc'


Figure5.5: Open .bashrc file
Once the '.bashrc' will open then at the end of '.bachrc' file, add the following lines or settings:
# set hadoop-related environment variables
Export hadoop_home=/usr/local/hadoop
# set java_home
Export java_home=/usr/lib/jvm/java-7-oracle
unalias fs>-- /dev/null
Alias fs 'hadoop fs'
Unalias- hls &> /dev/null
alias hls="fs -ls"
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add hadoop 'bin/ directory'to PATH
Export path=$path$hadoop_home/bin


Figure5.6: Update .bashrc file
To verify whether it has been saved correctly or not, please reopen the bash profile by using following commands:
1) source ~/.bashrc
2) echo $hadoop_home
3) echo $java_home

Figure5.7: Verify .bashrc file settings

Step 6: Changes in Hadoop Configuration
An XML file is used to configure each component inside Hadoop. 'Common properties' go in core-site.xml', 'HDFS properties' go in hdfs-site.xml', and 'Map-Reduce properties' go in mapred-site.xml'. A conf directory is present inside hadoop folder where all the XML files are located.
1) Changes in 'hadoop-env.sh'
First of all open the 'conf/hadoop-env.sh'file and set the 'JAVA_HOME'as:
export java_home=/usr/lib/jvm/java-7-oracle

While initializing the services if any error generate then might be'JAVA_HOME' is not set", then we have to remove the comment by removing special symbol(#) that is present in front of 'JAVA_HOME'

Figure5.8: Changes in 'hadoop-env.sh'

2) Changes in 'conf/core-site.xml'
Open the 'core-site.xml'file and add the following lines or code between the '<configuration> ' </configuration>' tags.
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description></description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description></description>
</property>

Figure5.9: Changes in 'conf/core-site.xml'

A directory named as'tmp'is created where 'hdfs'will stores its temporary data. The configurations that we mentioned above, we have used 'hadoop.tmp.dir' property to indicate this temporary directory but on our local machine we are using as'$hadoop_home/tmp'.
Commands to 'create tmp directory' and 'change ownership and permissions':
1) sudo mkdir -p $hadoop_home/tmp
2) sudo chown hduser:hadoop $hadoop_home /tmp
3) sudo chmod 750 $hadoop_home /tmp

3) Changes in 'conf/mapred-site.xml'
Open the 'mapred-site.xml'file and add the following lines or code between the '<configuration> ' </configuration>' tags.
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description> </description>
</property>


Figure5.10: Changes in 'conf/mapred-site.xml'

4) Changes in 'conf/hdfs-site.xml'
Open the 'hdfs-site.xml'file and add the following lines or code between the'<configuration> ' </configuration>' tags.
<property>
<name>dfs.replication</name>
<value>1</value>
<description> </description>
</property>

Figure5.11: Changes in 'conf/hdfs-site.xml'
Step 7: creating Name Node Directory
mkdir -p $hadoop_home/tmp/dfs/name
chown hduser:hadoop /usr/local/hadoop/tmp/dfs/name

Step 8: Format the name node
Hadoop Distributed file system i.e. hdfs is implemented on top of the local file system of your 'cluster'. So the initial step to starting up your hadoop installation is to format the
hadoop file system or name node.
Command :
$hadoop_home/bin/hadoop 'namenode 'format'

Figure5.12: Format the name node
Step 9: Starting single node hadoop cluster
Open Terminal then goto /bin directory inside hadoop folder and start hadoop using command mentioned below:
$hadoop_home/bin$ ./start-all.sh

Figure5.13: Starting single node hadoop cluster
Now after successfully performed each step that we mentioned above, it is the time to find whether
all nodes are running properly inside hadoop. We can use the following command to check it.
Command:
usr/local/hadoop/bin$ jps

Output must be like mentioned below, if we are getting such kind of output it means hadoop is running successfully.
4841 task tracker
4039 name node
4512 secondary name-node
4275 data node
4596 job tracker
4089 jps
It means hadoop has installed successfully and working fine.

Figure5.14: Hadoop Installed Successfully
5.2 Installation of Maven:
Step1:
Open the terminal then enter the below command to download and install maven
sudo apt-get install maven2
Step2:
Open the '.bashrc' file and add the lines mentioned below at the end of bash file.
Export M2_HOME=/usr/local/apache-maven-3.0.4
Export M2=$M2_HOME/bin
Export PATH=$M2:$PATH
Export JAVA_HOME=$HOME/programs/jdk
Step3:
Set the 'java_home' in '.bashrc' file
Export java_home= '/usr/lib/jvm/java-7-oracle'
Add Java jre/ 'PATH' of directory
Export PATH=$PATH:$java_home/jre
Step 4:
Run mvn --version to verify that it is correctly installed. If the message as shown below displays
It shows maven installed successfully.


Figure5.15: Maven Installed Successfully

5.3 Installation of Mahout:
Step1:
Download the mahout source package in .zip format from the following link: http://www.apache.org/dyn/closer.cgi/lucene/mahout/
Step2:
Extract the folder into usr/local/mahout directory and check that pom.xml file exists inside it or not.
Step3:
Open the terminal and moved to usr/local/mahout directory then enter the following command:
mvn install (to install mahout on top of hadoop)
If the message as shown below displays then mahout installed successfully.

Figure5.16: Mahout Installed Successfully
5.4 How to Run a Simple Job on Hadoop:
1. Move to /usr/local/hadoop/bin directory and start the all nodes of hadoop using command
./start-all.sh

Figure5.17: Start Hadoop
2. Create a Word count text file inside local tmp directory using command
gedit wordcount.txt

Figure5.18: Create a text file in tmp directory
3. Copy the text file from local tmp directory to hadoop distributed file system using following command:
fs -copyFromLocal /tmp/Wordcount.txt /user/hduser/wordcountexample/Wordcount.txt


Figure5.19: Copy text file from tmp to hdfs

4. Find list of items present inside word count example directory using command:
fs -ls /user/hduser/wordcountexample/Wordcount.txt


Figure5.20: List of items present inside wordcount

5. Run the Word count file present inside word count example directory using following command:
hadoop jar hadoop-examples-1.0.4.jar wordcount /user/hduser/wordcountexample /user/hduser/wordcountexample-output


Figure5.21: Run the word-count map-reduce job
6. Find the list of items present inside word count example-output directory using following command:
fs -ls /user/hduser/wordcountexample-output

Figure5.22: List of items present word-count output directory

7. Run the output file generated to show the output on console using command:
fs -cat /user/hduser/wordcountexample-output/part-r-00000


Figure5.23: Run the output file
Chapter 6
Discussion of Results
6.1 Find Frequent Patterns and Recommendations from Big Data:

1. Open terminal and start Hadoop using command ./start-all.sh.


Figure6.1: Start the hadoop nodes

2. Convert the data set into .dat format which is required by shell script.

Figure6.2: Convert file format into .dat
3. Add the path of dataset into shell script.
4. Run the dataset on top of Hadoop using Map Reduce to find frequent patterns.
5. Start Mahout.


Figure6.3: Start Mahout
' To Run Mahout on top of Hadoop Set MAHOUT_LOCAL=True

Figure6.4: Set Mahout_Local=true
' After Setting MAHOUT_LOCAL=true, go to the bin directory where shell script is kept to find recommendations from dataset.

Figure6.5: Move to bin directory to run shell script

' Run the shell script by providing path of dataset

Figure6.6: Run the shell script to find recommendations


Chapter 7
Presentation of Results
After successfully running the script on mahout and performing number of map reduce jobs by hadoop, it generates recommendations in very less time as compared to apply same data set using simple java program in eclipse .
1. When the size of data was just 100 MB then Hadoop took 0.18705 minutes to process the data and generate recommendations.


Figure7.1: Data set is running and Map-Reduce took place

Figure7.2: Map-Reduce on data set


Figure7.3: Map-Reduce Job completed
Recommendations generated:

Figure7.4: Final result in terms of recommendations
2. When the size of data was 200 MB then Hadoop took 0.24005 minutes to process the data and generates recommendations.

Figure7.5: Map-Reduce Job Completed
Recommendations generated:

Figure7.6: Final result in terms of recommendations
3. When the size of data was 1 GB then Hadoop took 4.689 minutes to process the data and generate recommendations.

Figure7.7 Map-Reduce Job Completed
Recommendations generated:

Figure7.8: Final Result

Figure7.9: Item-Based Recommendations

Figure7.10: Item-Based Recommendations
CHAPTER 8
Conclusion
When I started the dissertation work I mentioned that size of data is growing day by day in gigabytes or in terabytes volume, so it is not easy for an organization to handle such big amount of data and could do predictions, find patterns and recommendations from such large amount of data using existing technologies in less time. But we use Hadoop and Mahout both together then the overhead of an organisation to analyse the big data will become very less and execution time required to find patterns and recommendations will also be so less. We took around 500 GB of e-commerce website data then performed frequent pattern mining and collaborative filtering using Mahout and Hadoop. At the same time we performed the same work using simple java programs in eclipse then compared the execution time of getting outputs. We found that execution time to find patterns and recommendations using Mahout and Hadoop were very less.

Chapter 9
Future Work
In future we have lot of scope to do because now days the data generated on the e-commerce sites, first of all we collect the data and store it in form of desired form then remove the unwanted columns that are not required in the analysis process then apply the techniques to find patterns and products that can be recommended to the users. But in future we can predict the recommendation on the real time environment like we don't need to store data in database, we can directly apply some techniques on real time data that is generated frequently in daily and hourly basis, reduce the overhead and increase the efficiency.

Source: Essay UK - http://doghouse.net/free-essays/information-technology/big-data.php


Not what you're looking for?

Search:


About this resource

This Information Technology essay was submitted to us by a student in order to help you with your studies.


Rating:

Rating  
No ratings yet!


Word count:

This page has approximately words.


Share:


Cite:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay UK, Big Data. Available from: <http://doghouse.net/free-essays/information-technology/big-data.php> [22-02-19].


More information:

If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal:


Essay and dissertation help

badges