Big Data Ingestion and Streaming
What is data Ingestion?
Data ingestion is the process of obtaining, importing, and processing data for later use or storage in a database. This process often involves altering individual files by editing their content and/or formatting them to fit into a larger document.
An effective data ingestion methodology begins by validating the individual files, then prioritizes the sources for optimum processing, and finally validates the results. When numerous data sources exist in diverse formats (the sources may number in the hundreds and the formats in the dozens), maintaining reasonable speed and efficiency can become a major challenge. To that end, several vendors offer programs tailored to the task of data ingestion in specific applications or environments.
Data Ingestion Frameworks
- Amazon Kinesis – real-time processing of streaming data at massive scale.
- Apache Chukwa – data collection system.
- Apache Flume – service to manage large amount of log data.
- Apache Kafka – distributed publish-subscribe messaging system.
- Apache Sqoop – tool to transfer data between Hadoop and a structured datastore.
- Cloudera Morphlines – framework that help ETL to Solr, HBase and HDFS.
- Facebook Scribe – streamed log data aggregator.
- Fluentd – tool to collect events and logs.
- Google Photon – geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
- Heka – open source stream processing software system.
- HIHO – framework for connecting disparate data sources with Hadoop.
- Kestrel – distributed message queue system.
- LinkedIn Databus – stream of change capture events for a database.
- LinkedIn Kamikaze – utility package for compressing sorted integer arrays.
- LinkedIn White Elephant – log aggregator and dashboard.
- Logstash – a tool for managing events and logs.
- Netflix Suro – log agregattor like Storm and Samza based on Chukwa.
- Pinterest Secor – is a service implementing Kafka log persistance.
- Linkedin Gobblin – linkedin’s universal data ingestion framework.
A Flume agent is a Java virtual machine (JVM) process that hosts the components through which events flow. Each agent contains at the minimum a source, a channel and a sink. An agent can also run multiple sets of channels and sinks through a flow multiplexer that either replicates or selectively routes an event. Agents can cascade to form a multihop tiered collection topology until the final datastore is reached.
A Flume event is a unit of data flow that contains a payload and an optional set of string attributes. An event is transmitted from its point of origination, normally called a client, to the source of an agent. When the source receives the event, it sends it to a channel that is a transient store for events within the agent. The associated sink can then remove the event from the channel and deliver it to the next agent or the event’s final destination.
Flume is distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS. It is designed to be reliable and highly available, while providing a simple, flexible, and intuitive programming model based on streaming data flows. Flume provides extensibility for online analytic applications that process data stream in situ. Flume and Chukwa share similar goals and features. However, there are some notable differences. Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper. In contrast, Chukwa distributes this information more broadly among its services. Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machine are responsible for deciding what data to send.
Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data.
Log processing was one of the original purposes of MapReduce. Unfortunately, Hadoop is hard to use for this purpose. Writing MapReduce jobs to process logs is somewhat tedious and the batch nature of MapReduce makes it difficult to use with logs that are generated incrementally across many machines. Furthermore, HDFS stil does not support appending to existing files. Chukwa is a Hadoop subproject that bridges that gap between log handling and MapReduce. It provides a scalable distributed system for monitoring and analysis of log-based data. Some of the durability features include agent-side replying of data to recover from errors.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. It offers two-way replication with both snapshots and incremental updates.
Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems
Apache Kafka is a distributed publish-subscribe messaging system. It is designed to provide high throughput persistent messaging that’s scalable and allows for parallel data loads into Hadoop. Its features include the use of compression to optimize IO performance and mirroring to improve availability, scalability and to optimize performance in multiple-cluster scenarios.
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Hadoop is ideal for batch-mode processing over massive data sets, but it doesn’t support event-stream (a.k.a. message-stream) processing, i.e., responding to individual events within a reasonable time frame. (For limited scenarios, you could use a NoSQL database like HBase to capture incoming data in the form of append updates.) Storm is a general-purpose, event-processing system that is growing in popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses a cluster of services for scalability and reliability. In Storm terminology you create a topology that runs continuously over a stream of incoming data, which is analogous to a Hadoop job that runs as a batch process over a fixed data set and then terminates. An apt analogy is a continuous stream of water flowing through plumbing. The data sources for the topology are called spouts and each processing node is called a bolt. Bolts can perform arbitrarily sophisticated computations on the data, including output to data stores and other services. It is common for organizations to run a combination of Hadoop and Storm services to gain the best features of both platforms.
How Storm Works
A storm cluster has three sets of nodes:
- Nimbus node (master node, similar to the Hadoop JobTracker):
- Uploads computations for execution
- Distributes code across the cluster
- Launches workers across the cluster
- Monitors computation and reallocates workers as needed
- ZooKeeper nodes – coordinates the Storm cluster
- Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus
Five key abstractions help to understand how Storm processes data:
- Tuples– an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
- Streams – an unbounded sequence of tuples.
- Spouts –sources of streams in a computation (e.g. a Twitter API)
- Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases.
- Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the following diagram)
Storm users define topologies for how to process the data when it comes streaming in from the spout. When the data comes in, it is processed and the results are passed into Hadoop.
Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.
Here are a few sample use-cases that Elasticsearch could be used for:
- You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them.
- You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you.
- You run a price alerting platform which allows price-savvy customers to specify a rule like “I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month”. In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
- You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data.
Spring XD is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export. The Spring XD project is an open source Apache 2 License licenced project whose goal is to tackle big data complexity. Much of the complexity in building real-world big data applications is related to integrating many disparate systems into one cohesive solution across a range of use-cases. Common use-cases encountered in creating a comprehensive big data solution are
- High throughput distributed data ingestion from a variety of input sources into big data store such as HDFS or Splunk
- Real-time analytics at ingestion time, e.g. gathering metrics and counting values.
- Workflow management via batch jobs. The jobs combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or HBase).
- High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database.
Apache Solr is a powerful search server, which supports REST like API. Solr is powered by Lucene which enables powerful matching capabilities like phrases, wildcards, joins, grouping and many more across various data types. It is highly optimized for high traffic using Apache Zookeeper. Apache Solr comes with a wide set of features and we have listed a subset of high impact features.
- Advanced Full-Text search capabilities.
- Standards based on Open Interfaces – XML, JSON and Http.
- Highly scalable and fault tolerant.
- Supports both Schema and Schemaless configuration.
- Faceted Search and Filtering.
- Support major languages like English, German, Chinese, Japanese, French and many more
- Rich Document Parsing.