Hadoop Data Ingestion

As an open source solution that runs on clusters of commodity hardware, Hadoop has emerged as a powerful and cost-effective platform for big data analytics. To tap into the value of big data and Hadoop, businesses must first solve the thorny problem of Hadoop data ingestion—the process of migrating data from source systems into a Hadoop cluster. Many of today's leading enterprises, across a range of industries, are finding Qlik Replicate® to be the ideal solution for meeting the challenges of Hadoop data ingestion.

Hadoop Data Ingestion Challenges: Taming the 3 V's

The main challenges for Hadoop data ingestion revolve around the oft-cited "3 V's" of big data: volume, variety, and velocity.

Volume. The first difficulty in implementing Hadoop data ingestion is the sheer volume of data involved—Hadoop clusters commonly span dozens, hundreds, or even thousands of nodes, and hundreds of terabytes or even petabytes of data. Qlik Replicate is an enterprise data integration platform, purpose-built for moving and managing big data. With a modular, multi-threaded, multi-server architecture, Replicate easily scales out to meet any organization's high-volume data ingestion needs, enabling users to configure and manage thousands of replication tasks across hundreds of sources through a single pane of glass.

Variety. A distinctive quality of a Hadoop data warehouse—sometimes called a Hadoop data lake—is that it brings together a wide range of . As a unified solution for Hadoop data ingestion, Qlik Replicate has the broadest source system support in the industry. Through a single solution, Replicate supports loading data into Hadoop from any major RDBMS, mainframe, data warehouse, SAP application, or flat file. And because Replicate empowers data managers and analysts to configure and execute Hadoop data ingestion jobs and processes without any manual coding, it's easy and fast to add new sources at any time.

Velocity. Today's enterprise data keeps coming with no let-up. For database and data warehouse sources, Qlik Replicate supports change data capture (CDC) to enable real-time data ingestion that feeds live data to your Hadoop cluster and your big data analytics. Replicate even integrates with Apache Kafka to stream data to multiple big data targets concurrently, such as Hadoop, Cassandra, and MongoDB.

Whitepaper

Real-Time Database Streaming for Kafka