BIG DATA

ASSIGNMENT-2

Big Data

Big data refers to data that is so large, fast or complex that it's difficult or impossible to process using traditional methods. Hence it requires technologies that are different from the traditional ones hence we use technologies called as big data technologies.

Big Data technologies

Big data technologies are the software tools used to manage all types of datasets and transform them into business insights. In data science careers, such as big data engineers, sophisticated analytics evaluate and process huge volumes of data.

Eg: Hadoop, Cassandra, Spark, Kafka, MongoDB, Etc.

Cassandra

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

Features Of Cassandra

Distributed

Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request.

Supports replication and multi data center replication: Replication strategies are configurable. Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery.

Scalability: Designed to have read and write throughput both increase linearly as new machines are added, with the aim of no downtime or interruption to applications.

Fault-tolerant: Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

Tunable consistency

Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra, Writes and reads offer a tunable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable", with the quorum level in the middle.

MapReduce support: Cassandra has Hadoop integration, with MapReduce support. There is support also for Apache Pig and Apache Hive.

Query language: Cassandra introduced the Cassandra Query Language (CQL). CQL is a simple interface for accessing Cassandra, as an alternative to the traditional Structured Query Language (SQL).

Eventual consistency: Cassandra manages eventual consistency of reads, upserts and deletes through Tombstones.

Cassandra Query Language

Cassandra introduced the Cassandra Query Language (CQL). CQL is a simple interface for accessing Cassandra, as an alternative to the traditional Structured Query Language (SQL). CQL adds an abstraction layer that hides implementation details of this structure and provides native syntaxes for collections and other common encodings. Language drivers are available for Java (JDBC), Python (DBAPI2), Node.JS (Datastax), Go (gocql) and C++.

The keyspace in Cassandra is a namespace that defines data replication across nodes. Therefore, replication is defined at the keyspace level. Below an example of keyspace creation, including a column family in CQL 3.0

History of Cassandra

Cassandra was developed at Facebook for inbox search.
It was open-sourced by Facebook in July 2008.
Cassandra was accepted into Apache Incubator in March 2009.
It was made an Apache top-level project since February 2010.

Cassandra can be summarized with the help of the video attached below

Kafka

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems (for data import/export) via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."

Architecture

Kafka stores key-value messages that come from arbitrarily many processes called producers. The data can be partitioned into different "partitions" within different "topics". Within a partition, messages are strictly ordered by their offsets (the position of a message within a partition), and indexed and stored together with a timestamp. Other processes called "consumers" can read messages from partitions. For stream processing, Kafka offers the Streams API that allows writing Java applications that consume data from Kafka and write results back to Kafka. Apache Kafka also works with external stream processing systems such as Apache Apex, Apache Beam, Apache Flink, Apache Spark, Apache Storm, and Apache NiFi.

Kafka runs on a cluster of one or more servers (called brokers), and the partitions of all topics are distributed across the cluster nodes. Additionally, partitions are replicated to multiple brokers. This architecture allows Kafka to deliver massive streams of messages in a fault-tolerant fashion and has allowed it to replace some of the conventional messaging systems like Java Message Service (JMS), Advanced Message Queuing Protocol (AMQP), etc. Since the 0.11.0.0 release, Kafka offers transactional writes, which provide exactly-once stream processing using the Streams API.

Kafka supports two types of topics: Regular and compacted. Regular topics can be configured with a retention time or a space bound. If there are records that are older than the specified retention time or if the space bound is exceeded for a partition, Kafka is allowed to delete old data to free storage space. By default, topics are configured with a retention time of 7 days, but it's also possible to store data indefinitely. For compacted topics, records don't expire based on time or space bounds. Instead, Kafka treats later messages as updates to older message with the same key and guarantees never to delete the latest message per key. Users can delete messages entirely by writing a so-called tombstone message with null-value for a specific key.

There are five major APIs in Kafka:

Producer API – Permits an application to publish streams of records.
Consumer API – Permits an application to subscribe to topics and processes streams of records.
Connector API – Executes the reusable producer and consumer APIs that can link the topics to the existing applications.
Streams API – This API converts the input streams to output and produces the result.
Admin API – Used to manage Kafka topics, brokers, and other Kafka objects.

The consumer and producer APIs are decoupled from the core functionality of Kafka through an underlying messaging protocol. This allows writing compatible API layers in any programming language that are as efficient as the Java APIs bundled with Kafka. The Apache Kafka project maintains a list of such third party APIs.

Kafka APIs

Connect API

Kafka Connect (or Connect API) is a framework to import/export data from/to other systems. It was added in the Kafka 0.9.0.0 release and uses the Producer and Consumer API internally. The Connect framework itself executes so-called "connectors" that implement the actual logic to read/write data from other systems. The Connect API defines the programming interface that must be implemented to build a custom connector. Many open source and commercial connectors for popular data systems are available already. However, Apache Kafka itself does not include production ready connectors.

Streams API

Kafka Streams (or Streams API) is a stream-processing library written in Java. It was added in the Kafka 0.10.0.0 release. The library allows for the development of stateful stream-processing applications that are scalable, elastic, and fully fault-tolerant. The main API is a stream-processing domain-specific language (DSL) that offers high-level operators like filter, map, grouping, windowing, aggregation, joins, and the notion of tables. Additionally, the Processor API can be used to implement custom operators for a more low-level development approach. The DSL and Processor API can be mixed, too. For stateful stream processing, Kafka Streams uses RocksDB to maintain local operator state. Because RocksDB can write to disk, the maintained state can be larger than available main memory. For fault-tolerance, all updates to local state stores are also written into a topic in the Kafka cluster. This allows recreating state by reading those topics and feed all data into RocksDB. The latest version of Streams API is 2.8.0. The link also contains information about how to upgrade to the latest version.

Overall Kafka can be summarised with the help of video attached below.

Comments

seep learnJuly 15, 2024 at 8:03 PM
great
seep learnJuly 15, 2024 at 8:04 PM
This comment has been removed by the author.
seep learnJuly 15, 2024 at 8:04 PM
Apache Kafka is an open-source distributed event streaming platform designed for high-throughput, fault-tolerant, and scalable processing of real-time data feeds. Here are the key aspects of Kafka:

Key Components
Producers: Applications that publish (write) messages to Kafka topics.
Consumers: Applications that subscribe to (read) messages from Kafka topics.
Topics: Categories or feeds where records are published. Each topic can have multiple partitions.
Brokers: Kafka servers that store data and serve client requests. Each broker can handle multiple partitions of various topics.
Zookeeper: Used for managing and coordinating Kafka brokers.
Features
Scalability: Easily scales horizontally by adding more brokers.
Durability: Messages are persisted to disk, ensuring that data is not lost.
Fault Tolerance: Supports replication of partitions across multiple brokers for high availability.
High Throughput: Capable of handling millions of messages per second, making it suitable for large-scale data processing.
Use Cases
Real-Time Data Processing: Stream data from various sources for real-time analytics and processing.
Log Aggregation: Collect logs from different services for centralized storage and analysis.
Event Sourcing: Maintain a log of events for applications that require the history of state changes.
Data Integration: Facilitate communication between disparate systems, enabling real-time data sharing.

Big Data Projects For Final Year Students

Image Processing Projects For Final Year

Deep Learning Projects for Final Year

Search This Blog

Big Data Technologies

Big Data Technologies (Assignment - 2)