Introduction to Apache Flume | Apache Flume User Guide

What is Apache Flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Flume Architecture

What is Flume Agent/Agent?

A Flume agent is a process (JVM) that hosts the components that allow Events to flow from an external source to an external destination. An Event is a unit of data that flows through a Flume agent. The Event flows from Source to Channel to Sink, and is represented by an implementation of the Event interface. An Event carries a payload (byte array) that is accompanied by an optional set of headers (string attributes).

What is Flume Source/Source?

A Source consumes Events having a specific format, and those Events are delivered to the Source by an external source like a web server. When a Source receives an Event, it stores it into one or more Channels.

What is Flume Channel/Channel?

The Channel is a passive store that holds the Event until that Event is consumed by a Sink. One type of Channel available in Flume is the FileChannel which uses the local filesystem as its backing store.

What is Flume Sink/Sink?

A Sink is responsible for removing an Event from the Channel and putting it into an external repository like HDFS (in the case of an HDFSEventSink) or forwarding it to the Source at the next hop of the flow. The Source and Sink within the given agent run asynchronously with the Events staged in the Channel.

What is Interceptor?

An interceptor can modify or even drop events based on any criteria chosen by the developer. or Interceptors in Flume are those who have the capability to modify/drop events in-flight.

What is Channel Selector?

Channel selector is used to determine which channel we should select to transfer the data in case of multiple channels. or Channel selector is the component of Flume that determines which channel that particular Flume event should go into when a group of channels exists.

1. Replicating channel selector 2. Multiplexing channel selector

What is Sink Processors?

Sink processor is mechanism by which you can create a fail-over task and load balancing. or To invoke a particular sink from the selected group of sinks, we generally use sink processors. Also, we use it to create failover paths for our sinks or load balance events across multiple sinks from a channel.

What is Event Serializers?

An Event Serializer is the mechanism by which a FlumeEvent is converted into another format for output. By default, the text serializer, which outputs just the Flume event body, is used. There is another serializer, header_and_text, which outputs both the headers and the body.

Simple Flume Agent Example

Flume Agent Definition Configuration

# Define a source, a channel, and a sink
flume_agent1.sources = spooldir_source
flume_agent1.channels = mem_channel
flume_agent1.sinks = hdfs_sink

# Set the source type to Spooling Directory and set the directory
# location to /home/cloudera/flume_agent1_data

flume_agent1.sources.spooldir_source.type = spooldir
flume_agent1.sources.spooldir_source.spoolDir = /home/cloudera/flume_agent1_data
flume_agent1.sources.spooldir_source.basenameHeader = true

# Configure the channel as simple in-memory queue
flume_agent1.channels.mem_channel.type = memory
flume_agent1.channels.mem_channel.capacity = 1000

# Define the HDFS sink and set its path to your target HDFS directory
flume_agent1.sinks.hdfs_sink.type = hdfs
flume_agent1.sinks.hdfs_sink.hdfs.path = hdfs://quickstart.cloudera:8020/user/cloudera/data/flume/flume_agent1_data
flume_agent1.sinks.hdfs_sink.hdfs.fileType = DataStream
flume_agent1.sinks.hdfs_sink.hdfs.writeFormat = Text

# Disable rollover functionallity as we want to keep the original files
flume_agent1.sinks.hdfs_sink.rollCount = 0
flume_agent1.sinks.hdfs_sink.rollInterval = 0
flume_agent1.sinks.hdfs_sink.rollSize = 0
flume_agent1.sinks.hdfs_sink.idleTimeout = 0

# Set the files to their original name
flume_agent1.sinks.hdfs_sink.hdfs.filePrefix = %{basename}

# Connect source and sink
flume_agent1.sources.spooldir_source.channels = mem_channel = mem_channel

Run Flume Agent

flume-ng agent -n flume_agent1 -f /home/cloudera/flume_example/flume-agent-spooldir.conf


Happy Learning !!!

Post a Comment