SamzaContainer
The SamzaContainer is responsible for managing the startup, execution, and shutdown of one or more StreamTask instances. Each SamzaContainer typically runs as an indepentent Java virtual machine. A Samza job can consist of several SamzaContainers, potentially running on different machines.
When a SamzaContainer starts up, it does the following:
- Get last checkpointed offset for each input stream partition that it consumes
- Create a “reader” thread for every input stream partition that it consumes
- Start metrics reporters to report metrics
- Start a checkpoint timer to save your task’s input stream offsets every so often
- Start a window timer to trigger your task’s window method, if it is defined
- Instantiate and initialize your StreamTask once for each input stream partition
- Start an event loop that takes messages from the input stream reader threads, and gives them to your StreamTasks
- Notify lifecycle listeners during each one of these steps
Let’s start in the middle, with the instantiation of a StreamTask. The following sections of the documentation cover the other steps.
Tasks and Partitions
When the container starts, it creates instances of the task class that you’ve written. If the task class implements the InitableTask interface, the SamzaContainer will also call the init() method.
/** Implement this if you want a callback when your task starts up. */
public interface InitableTask {
void init(Config config, TaskContext context);
}
How many instances of your task class are created depends on the number of partitions in the job’s input streams. If your Samza job has ten partitions, there will be ten instantiations of your task class: one for each partition. The first task instance will receive all messages for partition one, the second instance will receive all messages for partition two, and so on.
The number of partitions in the input streams is determined by the systems from which you are consuming. For example, if your input system is Kafka, you can specify the number of partitions when you create a topic.
If a Samza job has more than one input stream, the number of task instances for the Samza job is the maximum number of partitions across all input streams. For example, if a Samza job is reading from PageViewEvent (12 partitions), and ServiceMetricEvent (14 partitions), then the Samza job would have 14 task instances (numbered 0 through 13). Task instances 12 and 13 only receive events from ServiceMetricEvent, because there is no corresponding PageViewEvent partition.
There is work underway to make the assignment of partitions to tasks more flexible in future versions of Samza.
Containers and resource allocation
Although the number of task instances is fixed — determined by the number of input partitions — you can configure how many containers you want to use for your job. If you are using YARN, the number of containers determines what CPU and memory resources are allocated to your job.
If the data volume on your input streams is small, it might be sufficient to use just one SamzaContainer. In that case, Samza still creates one task instance per input partition, but all those tasks run within the same container. At the other extreme, you can create as many containers as you have partitions, and Samza will assign one task instance to each container.
Each SamzaContainer is designed to use one CPU core, so it uses a single-threaded event loop for execution. It’s not advisable to create your own threads within a SamzaContainer. If you need more parallelism, please configure your job to use more containers.
Any state in your job belongs to a task instance, not to a container. This is a key design decision for Samza’s scalability: as your job’s resource requirements grow and shrink, you can simply increase or decrease the number of containers, but the number of task instances remains unchanged. As you scale up or down, the same state remains attached to each task instance. Task instances may be moved from one container to another, and any persistent state managed by Samza will be moved with it. This allows the job’s processing semantics to remain unchanged, even as you change the job’s parallelism.
Joining multiple input streams
If your job has multiple input streams, Samza provides a simple but powerful mechanism for joining data from different streams: each task instance receives messages from one partition of each of the input streams. For example, say you have two input streams, A and B, each with four partitions. Samza creates four task instances to process them, and assigns the partitions as follows:
Task instance | Consumes stream partitions |
---|---|
0 | stream A partition 0, stream B partition 0 |
1 | stream A partition 1, stream B partition 1 |
2 | stream A partition 2, stream B partition 2 |
3 | stream A partition 3, stream B partition 3 |
Thus, if you want two events in different streams to be processed by the same task instance, you need to ensure they are sent to the same partition number. You can achieve this by using the same partitioning key when sending the messages. Joining streams is discussed in detail in the state management section.
There is one caveat in all of this: Samza currently assumes that a stream’s partition count will never change. Partition splitting or repartitioning is not supported. If an input stream has N partitions, it is expected that it has always had, and will always have N partitions. If you want to re-partition a stream, you can write a job that reads messages from the stream, and writes them out to a new stream with the required number of partitions. For example, you could read messages from PageViewEvent, and write them to PageViewEventRepartition.