Aggregate Functions (E.G. Sum, Avg, Min, Max, And Count) In Mashql To Retrieve The Rdf Data In Structured Format

Abstract - Recently, in many business applications web data is used for running business operations. Formatted data on the web is inspired by huge data on the web and necessity of retrieving web data in structured format. Traditional search engines like Yahoo pipes, Google retrieving data using keyword search but its giving ambiguous results and unnecessary data which user does not want.
In this paper we present Aggregate functions (e.g. SUM, AVG, MIN, MAX, and COUNT) in MashQL to retrieve the RDF data in structured format. By using these functions, user can perform a calculation on a set of values and return a single aggregated value from the retrieved results of RDF data. It helps user to understand statistical data from huge web data sources which is in RDF format. SPARQL query language is used to retrieve the data from multiple web data sources in RDF format. Also, we mentioned an approach of keyword search on RDF web data by introducing a search box on MashQL. It allows keyword search and then use MashQL to filter the retrieved result. We also illustrate how response time is less when we apply search box on top of MashQL. We evaluate response time using huge web datasets: DBPedia and DBLP data sets.
In this paper we refer the method to retrieve the web data in structured format using MashQL approach with the help of SPARQL language which allows user to query, navigate, and mash-up a data source(s) without knowing schema and how the data is stored into the database. Assuming web data as input in RDF format used to represent the metadata of the Web applications in structured manner [1].

Keyword' Query Formulation, Data mashup, Web Data, RDF, SPARQL.

I. INTRODUCTION

To retrieve the huge amount of web data and allow people to combine data from different sources into one application, several mashup editors have been introduced like DERI pipe, Google mashup, Yahoo Pipes and others. Out of which, Yahoo Pipes received the most attention because of its user friendly environment. In addition Yahoo! Pipes are considered to be a milestone in the history of the internet. On the other hand, Yahoo! Pipes are limited since the mashups are focused only on Web feeds. In this MashQL approach is used for mash-ups as data queries where websites view as a database, and each source is seen as a table and a combining data from various sources as a query. Assuming the websites data is represented in RDF format.
A challenge is too considered in the implementation is that a query might involve multiple sources. In this point of view, it allows users to query structured data flexibly when having many sources. It is need to invent a mechanism that will allow the involvement of multiple resources in an easy and efficient way for the users. Building data mush-ups is a challenge, is used to easy access and expose structured data on the Web. It should consider following challenges which fulfill an implementation [2].

1. User should be able to perform aggregate functions on RDF data to know the statistical information.
2. User should be able to search the data by using keyword search.
3. The end user does not know the schema. (It does not know how the data is stored into the database).
4. The data might be schema free. (User can formulate queries without knowing the underlying data structure e.g. Google Search Engines).
5. A query may involve multiple data sources.
6. The query language is sufficiently expressive (i.e. not merely single purpose user interface).
7. It does not require programming skills.

Web application server receives input in form of strings from both other tiers: request that is user input from web browser and result set received from database server. It also generates output in forms of strings again: that is queries are sent to database server and data to web browsers. As most of the data among tiers flows through web application server, most of the attackers target application server. SQLIAs are top most harmful vulnerabilities for web application

II. RELATED WORK

This section reviews the main approaches to query formulation and how they relate to the MashQL.

Query-by-Form In this approach user can create a form in which you enter query criteria. This form contains blank text boxes. Each text box represents a field in a table that you want to query and you can make entries only in the text box for which you want to specify search criteria [7].The QBF form resembles a data entry form that has fields that match the fields in a table. A table that has fields for Name, Address, City, State, and ZIP Code can have a QBF form that has corresponding fields. To select certain records, you can enter any combination of search criteria in the form. You can specify a city only, or a city and a state, or a ZIP code only, or any other combination. Fields that you leave blank are ignored. This approach is the simplest one defined and is not flexible or expressive and also fails with the assumptions 2-4.

Query-by-Example This approach is another language for querying. It has a graphical user interface that allows users to write queries by creating example tables on the screen. This approach is especially suited for queries that are not to complex and can be expressed in terms of a few tables. It provides simple interface for a user to enter queries instead of writing an entire SQL command; the user can just fill in blanks or select items to define a query. Consider an example, If a user may want to select an entry from a table called 'Table1' with an ID of 123 using SQL, the user need to input the command 'select * from Table1 where ID=123'.The Query by Interface allow the user to just click on Table1,type '123' in the ID field and click on search[6].

The motivation behind QBE is that a parser can convert the user's actions into statements expressed in a database manipulation language, such as SQL. Behind the scenes, it is this statement that is actually executed. A suitably comprehensive frontend can minimize the burden on the user to remember the finer details of SQL, and it is easier and more productive for end-users (and even programmers) to select tables and columns by selecting them rather than typing in their names, It requires the data to be schema and user must be aware of database schema, for this reason this approach fails with assumption 1, 2 and 4.

Conceptual Queries This approach allows users to formulate queries directly on the conceptual schema itself. There are four query levels where user may communicate with information which is External level, Conceptual level, Logical level and Physical level of schema. The external level deals with an interface and input/output representations (for example screen forms and printed reports).At the logical level a data model can be used for storage (for example Entity Relation model or Object Oriented model) and information is expressed using the logical construct of that model like tables and various keys. At physical level specific database management system is chosen like Ms- Access or DB2 [9].

At the conceptual level, the information is expressed in most fundamental form using concept and language familiar to the users for example (Employee drives car) and ignoring implementation and external presentation. However, the data must have a schema, and the user must have a good knowledge of that schema [11]. So this approach fails with assumptions 1, 2, 3 and partly with 4.

Natural Language Queries In this approach the user does not use any special syntax, but enters their search in the form of a statement or a question e.g. what is aspirin? [4].A special feature supporting natural language queries is that a trailing question mark (''?, which normally would be treated as a wildcard character) will automatically be stripped from the query if there are at least three terms in the query and the initial term is a question word, who, what, when, where, why, or how. The problem is that this approach is basically bounded with the language ambiguity and the free mapping between the sentences and the data schemas so this approach fails with assumptions 2, 3 and relatively 4.

Visualize queries This approach allows a user to represent queries as a graph. To communicate the essence of a query though a visual semantics user can read data from a data source that supports the API, user can read an entire table, or user can run a query on the data source using the visualize query language. The query response is an easy to process data table designed to simplify data for visualization. It includes both the underlying values and the formatted values, column types, headers and more. This approach requires less programming skills to formulate queries and fails with assumptions 1 and 4.

Interactive Queries This approach allows for querying schema-free XML sources, without the assumption that the user should know the schema. A generic interactive query interface to an XML database used for semi structured data [8]. This is closest approach to MashQL, but does not support multiple sources and fails assumption 3.

Mashup Editors This approach allows querying without any prior knowledge about its structure or schema. User formulate query by selections from drop-down lists. User interact with the editor, the editor queries the data set in the background to generate structured data [2]. A mashup, in web development, is a web page, or web application, that uses and combines data, presentation or functionality from two or more sources to create new services. The term implies easy, fast integration, frequently using open application programming interfaces and data sources to produce enriched results that were not necessarily the original reason for producing the raw source data [1].

The main characteristics of a mashup are combination, visualization, and aggregation. It is important to make existing data more useful, for personal and professional use. To be able to permanently access the data of other services, mashups are generally client applications or hosted online. Mashup editors have greatly simplified the creation of mashups, significantly increasing the productivity of mashup developers and even opening mashup development to end-users and non-IT experts. Standard components and connectors enable designers to combine mashup resources in all sorts of complex ways with ease[10][14].
MashQL can be used anywhere to retrieve structured data on the web where web data is in RDF format. For e.g. www.books.com is website related to books data which stored in RDF format .Suppose user wants to display all the books name ,its corresponding Author and publishing year is more than 2008, then user need to perform these things from drop down list by selecting Book title as Subject ,then all the properties of books will be displayed as Predicate from that user will select author, then user need to select Object filtration like less than, more than,equals,contains and so on here our object filtration criteria is more than, then user need to select object value as 2008. So it will retrieve only book name, its corresponding Author and publishing year is more than 2008. Here, unnecessary data which is related to book will not get displayed in the result.

III. PROBLEM STATEMENT

In many business applications web data is used for running business operations. Formatted data on the web is inspired by huge data on the web and necessity of retrieving web data in structured format. Traditional search engines like Yahoo pipes, Google retrieving data using keyword search but its giving ambiguous results and unnecessary data which user does not want. Many times user need to know statistical data from huge RDF datasets by using aggregate functions such as (SUM, AVG, MAX, MIN, COUNT) which is not exist on MashQL. Also many times user needs to retrieve data related to keyword only from huge RDF datasets, which is not exist in MashQL [1].
In this paper we proposed an aggregate function (e.g. SUM, AVG, MAX, MIN, COUNT etc) in MashQL, so user can perform a calculation on a set of values and return a single value.
Also we proposed a search box on top of MashQL to allow keyword search and then use MashQL to filter the retrieved results. When we apply search box on top of MashQL query is formulated on retrieved results so result is faster.

IV. SOLUTION APPROACH AND METHODOLOGY

Structured data on the web use the DBPedia and DBLP dataset where all the data is stored in RDF format. Following are the technology which is used for structured data retrieval.

Resource Description Framework (RDF)
The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications [1] originally designed as a metadata data model. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax notations and data serialization formats. The RDF data model[2] is similar to classic conceptual modeling approaches such as entity relationship or class diagrams, as it is based upon the idea of making statements about resources (in particular web resources) in the form of subject, predicate and object expressions. These expressions are known as triples in RDF terminology.

The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, one way to represent the notion 'The sky has the color blue' in RDF is as the triple: a subject denoting 'the sky', a predicate denoting 'has the color', and an object denoting 'blue'. In object-oriented design; object (sky), attribute (color) and value (blue). RDF is an abstract model with several serialization formats (i.e., file formats), and so the particular way in which a resource or triple is encoded varies from format to format.

Simple Protocol and RDF Query Language (SPARQL)
SPARQL acronym for SPARQL Protocol and RDF Query Language is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium. SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. The SPARQL query processor will search for sets of triples that match these four triple patterns, binding the variables in the query to the corresponding parts of each triple. Important to note here is the 'property orientation' [12].

The SPARQL language specifies four different query variations for different purposes.
1. SELECT QUERY Used to extract raw values from a SPARQL endpoint, the results are returned in a table format.
2. CONSTRUCT QUERY Used to extract information from the SPARQL endpoint and transform the results into valid RDF.
3. ASK QUERY Used to provide a simple True/False result for a query on a SPARQL endpoint.
4. DESCRIBE QUERY Used to extract an RDF graph from the SPARQL endpoint, the contents of which is left to the endpoint to decide based on what the maintainer deems as useful information. Each of these query forms takes a WHERE block to restrict the query although in the case of the DESCRIBE query the WHERE is optional[14][15].

Proposed Architecture

The proposed architecture is shown below figure 2, which specifies various modules in a system.

Figure 1. Proposed Architecture for System

Mathematical Model

A mathematical model is a description of a system using
Mathematical concept as a set theory. The process of developing a mathematical model is termed mathematical
Modeling which describe the problem, input data, output data, functionality and constraints [9] [2].

1. Let S be a system that describe MASHQL Editor which retrieves web data in structured format, from given web data/websites.
S= {''.
2. Identify input as A
S= {A, N,
Ai=Input web data set/website from which data need to be retrieved in structured format. Where 1<=n
N=Number of web data/websites given as an input.
3. Identify Output as O
S= {A, N, Ks, O'
Ks=Provide keyword in search box
O=Retrieved structured data
4. Identify the process as P
S= {A, N, O, P, '
P= {Pd, Pf, Ps, Pe, Ag}
5. Pd=Retrieve the Subject, Object and Predicate from the web data and provide aggregate function.
Pf=Formulate the MASHQL query based on the input
6. Ps=Translate the MASHQL query into the SPARQL query
7. Pe=Execute the SPARQL query on web data and retrieve the data in structured format.
8. Ag=Perform aggregate functions on retrieved result
9. Identify success as SU
S= {A, N, Ks, O, P, SU,}
SU=Success is defined
O=Web data is retrieved successfully
10. Identify Initial condition SI
11. S= S= {A,N, Ks, O,P,SU,SI}
The data is retrieved successfully in structured format and it return single value based on specific aggregate function also reduced response time by providing keyword search.
System Model for MashQL

The editor uses the 3-tier architecture. The client from a browser can interact with the editor which resides in the application server site. The system model in following figure presents the interaction that the client has with the editor and the editor with the database [7].


Figure 2. System Model for MashQL

When the user initialize a new source in the RDF format, the loader mechanism of the editor, executed in the application server site, is responsible to send the request to the database and check if the source already exists in the database. If not, then the loader is responsible to signal a source downloading to the database and inform the user that the source is ready to use. The loader is implemented on the application server site, since it has to interact with the database frequently [13].
When source is ready then user need to formulate a query. The background queries mechanism is implemented in the application serve site. When the mechanism receives a request it formulates the appropriate query and send it to the database for execution. The results are fetched back from the database, and passed to the client site.
When the query formulation is finished, the user can execute query by calling the parser mechanism which is responsible to parse the MashQL query diagram into Oracle SPARQL query and send it to the Query mechanism which interacts with the database and sends the results to the results renderer for visualization of result in structured format.
When the user wants to save retrieved structured data into the database then pipe generator mechanism is used.

V. RESULTS AND DISCUSSION

This section presents two types of evaluations:
1. Accuracy of RDF data retrieval using aggregate functions such as SUM, AVG, MIN, MAX and COUNT.
2. The time cost or response time which is required to retrieve the RDF data in structured format with search box (keyword search) and without search box on top of MashQL.

Accuracy of aggregate functions

By using an aggregate function (SUM, AVG, MIN, MAX and COUNT) in MashQL, user can perform a calculation on a set of values and return a single value. Aggregate functions are tested by using public datasets DBPedia and DBLP and ensured that aggregate functions are working as expected. For example Book.rdf dataset contains data shown as below in tabular format after retrieving and converting data in relational format.

Table 1: Sample Data for RDF in Relational Format

Here, Computer Graphics book has 3 different prices (518,900 and 1500) after performing aggregate functions on price result shown as below:

Table 2: Result for Aggregate Functions

Note that the above results are shown for illustration purpose however, aggregate functions are tested on huge datasets such as DBPedia and DBLP having thousands of records.

Response Time Evaluation with Keyword Search

This section evaluates the response time of the MashQL by user interaction with search box (keyword search) and without search box on top of MashQL. Following are the two cases for evaluation.
1. Search box on top of MashQL to allow keyword Search and then use MashQL to filter the retrieved result. When we apply search box on top of MashQL query is formulated on retrieved results so result is faster.
2. To formulate the queries on MashQL without giving keyword in search box we verified the response time in both the cases. As per our algorithm response time in case 1 is much faster than response time in case 2 because MashQL run on retrieved keyword search results.

Response time from input to output of RDF dataset are tested with RDF datasets having different sizes and found that, response time is less when keyword is provided in search box.

For example RDF datasets,
http://data.semanticweb.org/dumps/conferences/eswc-2007-complete.rdf have total ten thousand records out of which there are one thousand records matching with the keyword 'paper'. So, when we are provided keyword as paper in search box in MashQL, all subject, predicate object matching with the keyword 'paper' are retrieved from the web data (i.e. one thousand records) and then user can perform MashQL operations on these records instead of ten thousand records. In this way it saves the time cost.
Following are the results as shown in table.

Table 3: Response Time Without and With Keyword Search

We also tested response time of with keyword search by providing multiple datasets at a time as an input.
VI. CONCLUSION AND FUTURE WORK

We proposed Aggregate functions (e.g. SUM, AVG, MIN, MAX, and COUNT) in MashQL to retrieve the RDF data in structured format which helps user to understand statistical data from huge web data sources which is in RDF format. We also illustrated how less response time is required when keyword is given in search box on top of MashQL. We evaluated response time using huge public web datasets.

There is a lot of scope that can be done in order to improve the developed system in various ways. We can use MashQL as a business rules language, so it includes several operators. We can extend graph signature for general purpose query optimization for that we need to extend our query model to retrieve leaf node as well as any node in a tree[1].

Source: Essay UK - http://doghouse.net/free-essays/information-technology/aggregate-functions.php


Not what you're looking for?

Search:


About this resource

This Information Technology essay was submitted to us by a student in order to help you with your studies.


Rating:

Rating  
No ratings yet!


Word count:

This page has approximately words.


Share:


Cite:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay UK, Aggregate Functions (E.G. Sum, Avg, Min, Max, And Count) In Mashql. Available from: <http://doghouse.net/free-essays/information-technology/aggregate-functions.php> [22-02-19].


More information:

If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal:


Essay and dissertation help

badges