With vast amounts of data now available, organizations in almost every industry are focused on exploiting data for competitive advantage. On the other hand broad availability of data has led to increasing interest in new methods for extracting beneficial information and knowledge from data. The new discipline called data science has arose as a new paradigm to tackle these vast accumulation of data. Today it applies all most every fields in the world for different aspects. Mainly in security, health care, business, agriculture, transport, education, prediction, telecommunication, etc. Each area will also garner a different amount of return on their data science investment. This review aims to provide an overview of data science and mainly how some of these fields are currently using data science and how they could leverage it in their favor in the future.
What is Data Science?
According to Dhar, V. (2012), the fact that we now have vast amounts of data should not in and of itself justify the need for a new term âData scienceâ. It is a known fact that the extraction of information from data has been done by the use of statistics over decades. Nevertheless there are many reasons to consider Data science as a new field. First, the raw material, the âdataâ part of Data Science, is increasingly diverse and unstructured â” text, images, and video, frequently arising from networks with complex relationships among its entities. Further he reveals that the relative expected volumes of unstructured and structured data between 2008 and 2015, projecting a difference of almost 200 petabytes in 2015 compared to a difference of 50 petabytes in 2012. Secondly, the creation of markup languages, tags, etc. are mainly designed to let computers interpret data automatically, making them active agents in the manner of decision making (Dhar, V. (2012)). i.e. computers are increasingly doing the background work for each other.
In the proceedings paper by Zhu, Y. & Xiong, Y. (2015), they have mentioned that the Data science research objects, goals, and techniques are essentially different from those of computer science, information science or knowledge science. Throughout the paper they always use a comparative method to discuss how data science differs from existing technologies and established sciences. According to them data science supports natural science and social science and dealing with data is one of the driving forces behind data science. Hence, they referred data science as a data-intensive science.
It is evident that data science should be considered as a new science and new techniques and methods should be introduced in order to deal with its vast amount of data. Dhar, V. (2012) nicely explains how traditional data base methods are not suited for knowledge discovery. He explains that traditional data base methods are optimized for summarization of data given what the user wants to ask but not discovery of patterns in massive amount of data when the user does not have a well formulated query. i.e. Unlike database querying which asks âwhat data satisfy this patternâ discovery is about asking âwhat patterns satisfy the given data?â. Specifically ultimate goal of Data science is to finding interesting and robust patterns that satisfy the data.
When the new technologies emerged it lead to research on the data themselves. Lot of fields such as health care, business got the advantage of that and they were able to discover growth patterns of data and predict the scale of data in cyberspace ten years into the future(Zhu, Y. & Xiong, Y. (2015)).This causes to discover lot of new theories, inventions, that havenât been uncovered for years. Many health related issues have been identified and solved using big data analytics.
Applications of Data Science
1) Health care
A key contemporary trend emerging in big data science is the quantified self (QS) -individuals engaged in the self-tracking of any kind of biological, physical, behavioral, or environmental information as n = 1 individuals or in groups(Swan, M. (2013)). In this article writer emphasis that Quantified self-projects are becoming an interesting data management and manipulation challenge for big data science in the areas of data collection, integration, and analysis. But at the same time he reveals, when as much larger QS data sets are being generated the quantified self, and health and biology more generally, are becoming full-fledged big data problems in many ways. Variety of self-tracking projects were conducted recently including food visualization, investigation of media consumption and reading habits, multilayer investigation in to diabetes and heart disease risk, idea tracking process, etc. These projects demonstrate the range of topics, depth of problem solving, and variety of methodologies of QS projects. Big health data streams are the main data stream in QS and most difficult task is to integrating big health data streams, especially blending genomic and environmental data. It was found that genetics has a one third of contribution of outcome to diseases like cancer and heart disease. Projects such as DIY genomics studies, 4P medicine personalized, Crohnâs disease tracking microbiomic sequencing, lactoferrin analysis project and Thyroid Hormone testing project are famous examples for applications of QS in genomics. These findings couldnât be found if data science doesnât exist or the newly tools which can handle large volume of data were not invented.
Further Swan, M. (2013) suggests QS data streams need to be linked to healthy population longitudinal self-tracking more generally as they are the corresponding healthy cohorts to patient cohorts in clinical trials and he predicts eye tracking and emotion measurement could be coming in the future.
X. Shi and S. Wang (2015) have written an article to provide an overview of theoretical background of applying the cyberGIS (Geographical information science) approach to spatial analysis for health studies. (cyberGIS is defined as geographic information systems and science based on advanced cyberinfrastructure.) As spatial analysis is a tool for analyze big data it has a high usage in medical fields. According to them many review of literature find that a majority of the methods use only geographically local information and generate non-parametric measurements. It can be found multiple related cases where computational and data sciences are central to solving challenging problems in the framework of health-GIS. Disease mapping is one of the major areas and it is basically used to measure the intensity of a disease in a particular area. Data aggregation is a method which was developed to deal with cancer registry data bases and birth defect data bases. These applications were not only limited for the disease base assessments but it also assess the environmental facts which associate with health such as disparities in geographic access to health care. X. Shi and S. Wang (2015) mentioned a study which estimated the distances or travel times from patientsâ locations represented by polygon level data.
In conclusion, the above mentioned two articles review that health care is a field with lot of untapped potentials and the use of big data or data science is not only limited for find remedies for diseases but also it assess the other factors such as which causes the efficiency of health care.
2) Social media and networks
Swan, M. (2013) points out having large data quantities continues to allow for new methods and discovery. As Google has proved, finally having large enough data sets was the key moment for progress in many venues, where simple machine-learning algorithms could then be run over large data amounts to produce significant results. She explains this through googlesâ spelling correction and language translation, image recognition and cultural anthropology via word searches on a database of 5 million digitally scanned books. Dhar, V. (2012) also address this manner in his article and further he explains that Googleâs language translator doesnât âunderstandâ language, nor do its algorithms know the contents on webpages. So such efficient and accurate systems were invented using machine learning algorithms not by tackle this problem through an extensive enumeration of possibilities but rather, âtrainâ a computer to interpret questions correctly based on large numbers of examples. In addition, he emphasis knowledge of text processing or âtext miningâ is becoming essential in light of the explosion of text and other unstructured data in healthcare systems, social networks, and other sectors.
Data science has been attracting a great deal of attention nowadays in academia and environments which dealt with theories and formulas. It improves the current research methods for scientific research in order to form new methods and improve specific theories, methods, and technologies in various fields (Zhu, Y. & Xiong, Y. (2015)). Vast accumulation of data provides the opportunity to filter considerably large portion which is useful to a particular object. It provides a great platform to research rare and important matters in any field. At the same time they argued data science itself requires more fundamental theories and new methods and techniques; for example, the existence of data, the measurement of data, time in cyberspace, data algebra, data similarities and the theory of clusters, data classification etc. New action plans, conferences, workshops, Data science journals, institutes specifically for data science, study materials in universities to study data science as a subject, etc. will increase the awareness and understanding on this new science.
Business field is one of the major sectors which gain benefits from data science principals and data mining techniques. Data mining techniques are widely used in marketing for tasks such as targeted marketing, online advertising, and recommendations for cross-selling (Provost, F. & Fawcett, T. (2013)). According to them data science is mainly used in business fields with the objective of improving decision making. They have mentioned two types of decisions; (1) decisions for which âdiscoveriesâ need to be made within data, and (2) decisions that repeat, especially at massive scale. In this article Provost, F. & Fawcett, T. (2013) nicely explain how the companies trying to increase their customers by using data science approach. They give an example using company called âTargetâ who sells baby related products. In order to increase the no of customers they were interested in whether they could predict that people are expecting a baby in advance. If so, they could make offers to them before their competitors. Usually most birth records are public, so retailers obtain those information and aware the new parents about their new offers. If the information could get before the baby was born then the ones who got that information first would gain an advantage on their marketing campaign. By using data science techniques, they analyzed historical data on customers and identified group of customers who later revealed to have been pregnant. This can be predicted by using change in motherâs diet, vitamin regimens, etc.
According to Provost, F. & Fawcett, T. (2013), banking sector is also gain advantage of using Data science, and they were able to do more sophisticated predictive modeling on pricing, credit limits, low-initial-rate balance transfers, cash back, loyalty points, etc. Specially credit card system is also a outcome of big data analytics. Further they claim that the banks with bigger data assets may have an important strategic advantage over their smaller competitors.so the net result will be either increased acceptance of the bankâs products, decreased cost of customer acquirement, or both.
Customer churn is the most critical problem that service providers face. Customers switching from one company to another is called churn (Provost, F. & Fawcett, T. (2013)). They state that attracting new customers is much more expensive than retaining existing customers. So each and every service provider is trying to prevent customer chain by giving them a new retention offer. Data mining techniques are majorly used to identify customers who tend to churn.
Challenges and barriers
In this process lot of personal data were stored in each individual from any sector. With these vast amount of personal data there are certain boundaries and issues which the researchers or data scientists should be considered. Mainly in health data lot of patients are not comfortable with sharing their data in public (Swan, M. (2013)). In her opinion it is necessary to think about personal data privacy rights and neural data privacy rights proactively to facilitate humanityâs future directions in a mature, comfortable, and empowering way. Dhar, V. (2012) also address this matter. He explains with the vast technology development computer has become the decision maker, unaided by the humans and it raises multitude issues such as cost of incorrect decisions and ethical issues.
It is evident Data science is a newly emerged science that requires overall knowledge mainly in computational science, statistics and mathematics. New technologies are emerged to dealing with massive amount of data in any field and benefits are many and they are ranging from health care to telecommunication. At the same time they should be handled cautiously to ensure that the respondentsâ information are not exploited. In the near future data science will uncover many discoveries that support humans to improve their life style in every aspect.