Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content
In today's world, real-time data or streaming data can be conceived as a continuous and changing sequence of data that continuously arrive at a system to store or process. Big Data is also one of the hottest research topics in big... more
In today's world, real-time data or streaming data can be conceived as a continuous and changing sequence of data that continuously arrive at a system to store or process. Big Data is also one of the hottest research topics in big data computing and it requires different approaches: techniques, tools and architecture. Big data security also faces the need to effectively enforce security policies to protect sensitive data. Trying to satisfy this need, we proposed the secure big data pipeline architecture for the scalability and security. Throughout our work, we emphasize about the security of message. We use Apache Kafka and Apache Storm for real time streaming pipeline and also use sticky policies and encryption/decryption algorithm for security.
Replication is one of the important roles in cloud storage to improve data availability, fault tolerance and throughput for users and control storage cost. As data access pattern changes every time, the nature of popular files is... more
Replication is one of the important roles in cloud storage to improve data availability, fault tolerance and throughput for users and control storage cost. As data access pattern changes every time, the nature of popular files is unpredictable and unstable. Therefore, data popularity is taken into account as an important factor in replication. Data popularity in replication impacts an efficient storage because it is able to reduce waste storage for unpopular files. Also, data locality is a key issue in storage system and this consequence occurs performance overhead of system. Therefore, this paper introduces a replication strategy for cloud storage. The proposed strategy contains two portions; replica popularity and replica placement. First for replica popularity, popularity is taken into account by analyzing the changes in data access pattern. Second for replica placement, replicas are placed and performed on dedicated assigned nodes in order to enhance data locality. The proposed ...
Nowadays, replication technique is widely used in data center storage systems to prevent data loss. Data popularity is a key factor in data replication as popular files are accessed most frequently and then they become unstable and... more
Nowadays, replication technique is widely used in data center storage systems to prevent data loss. Data popularity is a key factor in data replication as popular files are accessed most frequently and then they become unstable and unpredictable. Moreover, replicas placement is one of key issues that affect the performance of the system such as load balancing, data locality etc. Data locality is a fundamental problem to data-parallel applications that often happens (i.e., a data block should be copied to the processing node when a processing node does not possess the data block in its local storage), and this problem leads to the decrease in performance. To address these challenges, this paper proposes a dynamic replication management scheme based on data popularity and data locality; it includes replica allocation and replica placement algorithms. Data locality, disk bandwidth, CPU processing speed and storage utilization are considered in the proposed data placement algorithm in o...
The proliferation of unstructured data continues to grow within organizations of all types. This data growth has introduced the key question of how we effectively find and manage them in the growing sea of information. As a result, there... more
The proliferation of unstructured data continues to grow within organizations of all types. This data growth has introduced the key question of how we effectively find and manage them in the growing sea of information. As a result, there has been an increasing demand for efficient search on them. Providing effective indexing and search on unstructured data is not a simple task. Unstructured data include documents, images, audio, video and so on. In this paper, we propose an efficient indexing and searching framework for unstructured data. In this framework, text-based and content-based approaches are incorporated for unstructured data retrieval. Our retrieval framework can support various types of queries and can accept multimedia examples and metadata-based documents. The aim of this paper is to use various features of multimedia data and to make content-based multimedia retrieval system more efficient.
Replication plays an important role for storage system to improve data availability, throughput and response time for user and control storage cost. Due to different nature of data access pattern, data popularity is important in... more
Replication plays an important role for storage system to improve data availability, throughput and response time for user and control storage cost. Due to different nature of data access pattern, data popularity is important in replication because of the unstable and unpredictable nature of popular files. Also, replicas placement is important in consideration of system's performance. In data-parallel applications, data locality is a key issue and this consequence of this issue occurs the decrement of system' performance. Therefore, this paper proposes a data locality-based replication for Hadoop Distributed File System (HDFS). In replica allocation, data popularity is considered for maintaining less replicas for unpopular data and also, disk bandwidth, CPU utilization and disk utilization are considered in the proposed replica placement algorithm in order to get better data locality and more effective storage utilization. Our proposed scheme will be effective for HDFS.
In today's world, real-time data or streaming data can be conceived as a continuous and changing sequence of data that continuously arrive at a system to store or process. Big Data is also one of the hottest research topics in big data... more
In today's world, real-time data or streaming data can be conceived as a continuous and changing sequence of data that continuously arrive at a system to store or process. Big Data is also one of the hottest research topics in big data computing and it requires different approaches: techniques, tools and architecture. Big data security also faces the need to effectively enforce security policies to protect sensitive data. Trying to satisfy this need, we proposed the secure big data pipeline architecture for the scalability and security. Throughout our work, we emphasize about the security of message. We use Apache Kafka and Apache Storm for real time streaming pipeline and also use sticky policies and encryption/decryption algorithm for security.
Data continue a massive expansion in scale, diversity, and complexity. Data underpin in all sectors of society. Achieving the full transformative potential from the use of this massive data in increasingly digital world requires not only... more
Data continue a massive expansion in scale, diversity, and complexity. Data underpin in all sectors of society. Achieving the full transformative potential from the use of this massive data in increasingly digital world requires not only new data analysis algorithms but also a new generation of distributed computing platforms. Big data analytics is an area of rapidly growing diversity. It requires massive performance, scalability and fault tolerance.
Existing big data platforms cannot scale to big data volumes, cannot handle mixed workloads, cannot respond to queries quickly, load data too slowly and lack processing capacity for analytics. Traditional data warehousing is a large but relatively slow producer of information to analytics users and mostly ideal for analyzing structured data from various systems. Distributed scale-out storage system meets the needs of big data challenges. Hadoop-based platform is well suited to deal with not only structured data but also semi structured and unstructured data, and provides scalability and fault tolerance. Therefore, Hadoop-based platform based on distributed scale-out storage system emerges to deal with big data. However, NameNode in Hadoop is used to store metadata in a single system’s memory, which is a performance bottleneck for scale-out. Gluster file system has no performance bottlenecks related to metadata because it uses an elastic hashing algorithm to place data across the nodes and it runs across all of those nodes.
The aim of this research is to propose a big data analytics platform on distributed scale-out storage system to achieve massive performance, scalability and fault tolerance. It consists of two parts: big data processing and big data storage. For big data processing, Hadoop MapReduce is applied to handle mixed workloads, respond analytical queries rapidly and support various high level query languages. For big data storage, Gluster file system is used to achieve better scalability, fault tolerance and faster data loading. The main issue in Gluster file system is inefficient data rebalancing. Therefore, a data rebalancing mechanism for Gluster file system is proposed to achieve efficient storage utilization, to reduce the number of file migrations and to save files migration time.
The Hadoop big data platform (MapReduce and Hadoop Distributed File System) and the proposed big data platform (MapReduce and Gluster File System) are implemented on commodity Linux Virtual Machines clusters and performance evaluations are conducted. According to the evaluation analysis, the proposed big data platform provides better scalability, fault tolerance, and faster query response time than the Hadoop platform. According to the simulation results, the proposed data rebalancing mechanism provides 82% (fullness percent), 20% of the number of file migrations, 20% of the files migration time, and 73% of the number of required storage servers of the current mechanism of Gluster file system.
Research Interests:
The latest studies have shown a growing demand amongst patients for the ability to book their healthcare appointments online. In this day in age, booking everything from hotels and flights to restaurant reservations all online is... more
The latest studies have shown a growing demand amongst patients for the ability to book their healthcare appointments online. In this day in age, booking everything from hotels and flights to restaurant reservations all online is commonplace, not to mention convenient. Patients can now gain the same benefits when it comes to booking their medical appointments. The online booking system is also highly favored by hospitals, doctors, dentists and their staff as it saves them substantial time from scheduling appointments and allows them to allocate resources to other, more pertinent areas.
To save their time and energy, online clinic reservation system is proposed. The proposed system offers health professionals a more efficient and convenient way for patients to reserve appointments in the clinic. Users can perform search by doctor specialty, gender, and day and/or doctor name. Users can browse through doctors’ profile and view their specialty information. The registered user can choose and book an appointment at the flexible time and date. They can confirm or cancel their reservations. This system is implemented by using PHP Programming Language and MySQL is used to store data.
Research Interests:
Previously people wishing to visit places had to search for available accommodation at the visiting places. Also they had to make reservation themselves. People hardly had any knowledge of which are the worth seeing places and about its... more
Previously people wishing to visit places had to search for available accommodation at the visiting places. Also they had to make reservation themselves. People hardly had any knowledge of which are the worth seeing places and about its history. Such procedure was time consuming and energy wasting. To save their time and energy, online tour reservation system is proposed. The proposed system can provide information on tourist attractions and tour packages for ASEAN countries and India. Users can search tour packages with various criteria such as country, city, style, duration and price. Users can search hotels by country, city and star. They can also search flights by country, departure city and arrival city and view embassy information. The registered user can reserve the desired tour package and confirm or cancel the reservation. They can make a request for tour suggestion. They can share experiences, write comments on tours and places and vote tour guide. This system is implemented by using J2EE Programming Language, Struts and Tile Frameworks and Oracle 11g is used to store data.
Research Interests:
Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information. Existing big data platforms such as data warehouses cannot scale to... more
Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information. Existing big data platforms such as data warehouses cannot scale to big data volumes, cannot handle mixed workloads, and cannot respond to queries quickly. Hadoop-based platform based on distributed scale-out storage system emerges to deal with big data. In Hadoop NameNode is used to store metadata in a single system’s memory, which is a performance bottleneck for scale-out. Gluster file system has no performance bottlenecks related to metadata because it uses an elastic hashing algorithm to place data across the nodes and it runs across all of those nodes. To achieve massive performance, scalability and fault tolerance for big data analytics, a big data platform is proposed. The proposed big data platform consists of big data storage and big data processing. For big data storage, Gluster file system is used and Hadoop MapReduce is applied for big data processing. The Hadoop big data platform and the proposed big data platform are implemented on commodity linux virtual machines clusters and performance evaluations are conducted. According to the evaluation analysis, the proposed big data platform provides better scalability, fault tolerance, and faster query response time than the Hadoop platform.
Data continue a massive expansion in scale, diversity, and complexity. Data underpin activities in all sectors of society. Achieving the full transformative potential from the use of this massive data in increasingly digital world... more
Data continue a massive expansion in scale, diversity, and complexity. Data underpin activities in all sectors of society. Achieving the full transformative potential from the use of this massive data in increasingly digital world requires not only new data analysis algorithms but also a new generation of systems and distributed computing environments. These systems need to handle these issues such as the dramatic growth in the volume of data, the lack of structure for much of it and the increasing computational needs of massive scale analytics. In this paper, a highly scalable big data platform for massive scale analytics is proposed which is built on Hadoop MapReduce, Gluster file system and user friendly query languages such as Apache Hive, Apache Pig and Jaql. The Hadoop big data platform and proposed big data platform are implemented on commodity linux VMs clusters. These two platforms are evaluated with large census data set and we found that proposed platform provides better scalability than Hadoop platform.
In today’s world, almost every enterprise is seeing an explosion of data. They are getting huge amount of digital data generated daily. Such huge amount of data needs to be stored for various reasons. Now the important question that... more
In today’s world, almost every enterprise is seeing an explosion of data. They are getting huge amount of digital data generated daily.  Such huge amount of data needs to be stored for various reasons.  Now the important question that arises at this point of time is how do we store, manage, process and analyze such huge amount of data most of which is Semi structured or Unstructured in a scalable, fault tolerant and efficient manner. The challenges of big data are most of them is semi structured or unstructured data, need to carry out complex computations over big data and the time required to process big data is as low as possible. In this paper, we propose big data platform based on Hadoop MapReduce framework and Gluster file system over large scale shared storage system to address these challenges. Our big data platform can support large scale data analysis efficiently and effectively.
Data continue a massive expansion in scale, diversity, and complexity. Data underpin activities in all sectors of society. Achieving the full transformative potential from the use of data in this increasingly digital world requires not... more
Data continue a massive expansion in scale, diversity, and complexity. Data underpin activities in all sectors of society. Achieving the full transformative potential from the use of data in this increasingly digital world requires not only new data analysis algorithms but also a new generation of systems and distributed computing environments to handle the dramatic growth in the volume of data, the lack of structure for much of it and the increasing computational needs of massive-scale analytics. In this paper, we propose big data platform that is built upon open source and built on Hadoop MapReduce, Gluster File System, Apache Pig, Apache Hive and Jaql and compare our platform with other two big data platforms – IBM big data platform and Splunk. Our big data platform can support large scale data analysis efficiently and effectively.