Got a Hairball of Data Pipelines?

Real-World Best Practices for Big Data Ingestion


Keeping the data lake stocked with complete, current and high quality data is often the first problem the enterprise encounters with its brand new Hadoop or Spark cluster. Incomplete, inaccurate or late data leads to false positives, missed insights and negative business impact. To address data ingestion, you must solve for three things: the growing variety of data sources, the need to ingest continuously to meet real-time demands, and the insidious problem of data drift—unexpected changes to schema or semanticsthat silently corrodes data quality.

The traditional approach has been to on-board data sources using custom code low-level Apache frameworks like Sqoop, Flume and Kafka. The “hairball” of point-to-point pipelines this spawns is constantly under duress: pipelines break easily, they must be constantly rewritten for new operational or business requirements and they lack the needed instrumentation to monitor for the complete availability and accuracy of the data. This leads to delayed delivery of data to applications, an endless cycle of fire-fighting and maintenance and dangerous pollution of the data lake that damages analytical integrity.

In this webinar, we will show you how to take a structured approach to big data ingestion that solves these problems and ensures your architecture will thrive over the long-term. Drawing from real-world enterprise examples you will learn how to implement an efficient and effective operation built on top of a reliable, continuous and fully automated data ingestion infrastructure.

Specifically, Kirit Basu from StreamSets and Mike Ferguson from Intelligent Business will cover how to:

  • Create a process for quickly on-boarding new batch and streaming data sources with minimal code but enhanced control.
  • Manage data movement as a continuous operation.
    Improve data quality by solving the problems caused by data drift.
  • Set and enforce “Data SLAs” around data availability and accuracy.
  • Build an agile data movement architecture that can adapt to infrastructure changes and new use cases. 

Webinar Registration

Featured Speakers


Moderator: Michael Ferguson

Managing Director, Intelligent Business Strategies, Ltd.

As an analyst and consultant, Mike Ferguson specialises in business intelligence, big data, data management and enterprise business integration. He has over 34 years of IT experience, and has spoken at events all over the world and written numerous articles.


Kirit Basu

Director of Product Management, StreamSets

Kirit Basu is the Director of Product Management at StreamSets. In the past he has worked at startups, edtech and healthcare organizations on a wide ranging set of technologies dealing with data big and small.