The following table lists all the standard properties that can be included in a Samza job configuration file.
Words highlighted like this are placeholders for your own variable names.
Name | Default | Description |
---|---|---|
Samza job configuration | ||
job.factory.class |
Required: The job factory to use for running this job.
The value is a fully-qualified Java classname, which must implement
StreamJobFactory.
Samza ships with two implementations:
|
|
job.name | Required: The name of your job. This name appears on the Samza dashboard, and it is used to tell apart this job's checkpoints from other jobs' checkpoints. | |
job.id | 1 |
If you run several instances of your job at the same time, you need to give each execution a
different job.id . This is important, since otherwise the jobs will overwrite each
others' checkpoints, and perhaps interfere with each other in other ways.
|
job.config.rewriter. rewriter-name.class |
You can optionally define configuration rewriters, which have the opportunity to dynamically
modify the job configuration before the job is started. For example, this can be useful for
pulling configuration from an external configuration management system, or for determining
the set of input streams dynamically at runtime. The value of this property is a
fully-qualified Java classname which must implement
ConfigRewriter.
Samza ships with one rewriter by default:
|
|
job.config.rewriters | If you have defined configuration rewriters, you need to list them here, in the order in which they should be applied. The value of this property is a comma-separated list of rewriter-name tokens. | |
job.systemstreampartition. grouper.factory |
org.apache.samza. container.grouper.stream. GroupByPartitionFactory |
A factory class that is used to determine how input SystemStreamPartitions are grouped together for processing in individual StreamTask instances. The factory must implement the SystemStreamPartitionGrouperFactory interface. Once this configuration is set, it can't be changed, since doing so could violate state semantics, and lead to a loss of data.
|
Task configuration | ||
task.class | Required: The fully-qualified name of the Java class which processes incoming messages from input streams. The class must implement StreamTask, and may optionally implement InitableTask, ClosableTask and/or WindowableTask. The class will be instantiated several times, once for every input stream partition. | |
task.inputs |
Required: A comma-separated list of streams that are consumed by this job.
Each stream is given in the format
system-name.stream-name.
For example, if you have one input system called my-kafka , and want to consume two
Kafka topics called PageViewEvent and UserActivityEvent , then you would set
task.inputs=my-kafka.PageViewEvent, my-kafka.UserActivityEvent .
|
|
task.window.ms | -1 | If task.class implements WindowableTask, it can receive a windowing callback in regular intervals. This property specifies the time between window() calls, in milliseconds. If the number is negative (the default), window() is never called. Note that Samza is single-threaded, so a window() call will never occur concurrently with the processing of a message. If a message is being processed at the time when a window() call is due, the window() call occurs after the processing of the current message has completed. |
task.checkpoint.factory |
To enable checkpointing, you must set
this property to the fully-qualified name of a Java class that implements
CheckpointManagerFactory.
This is not required, but recommended for most jobs. If you don't configure checkpointing,
and a job or container restarts, it does not remember which messages it has already processed.
Without checkpointing, consumer behavior is determined by the
...samza.offset.default
setting, which by default skips any messages that were published while the container was
restarting. Checkpointing allows a job to start up where it previously left off.
Samza ships with two checkpoint managers by default:
|
|
task.commit.ms | 60000 | If task.checkpoint.factory is configured, this property determines how often a checkpoint is written. The value is the time between checkpoints, in milliseconds. The frequency of checkpointing affects failure recovery: if a container fails unexpectedly (e.g. due to crash or machine failure) and is restarted, it resumes processing at the last checkpoint. Any messages processed since the last checkpoint on the failed container are processed again. Checkpointing more frequently reduces the number of messages that may be processed twice, but also uses more resources. |
task.command.class | org.apache.samza.job. ShellCommandBuilder |
The fully-qualified name of the Java class which determines the command line and environment
variables for a container. It must be a subclass of
CommandBuilder.
This defaults to task.command.class=org.apache.samza.job.ShellCommandBuilder .
|
task.opts |
Any JVM options to include in the command line when executing Samza containers. For example,
this can be used to set the JVM heap size, to tune the garbage collector, or to enable
remote debugging.
This cannot be used when running with ThreadJobFactory . Anything you put in
task.opts gets forwarded directly to the commandline as part of the JVM invocation.
|
|
task.java.home |
The JAVA_HOME path for Samza containers. By setting this property, you can use a java version that is
different from your cluster's java version. Remember to set the yarn.am.java.home as well.
|
|
task.execute | bin/run-container.sh | The command that starts a Samza container. The script must be included in the job package. There is usually no need to customize this. |
task.chooser.class | org.apache.samza. system.chooser. RoundRobinChooserFactory |
This property can be optionally set to override the default message chooser, which determines the order in which messages from multiple input streams are processed. The value of this property is the fully-qualified name of a Java class that implements MessageChooserFactory. |
task.lifecycle.listener. listener-name.class |
Use this property to register a lifecycle listener, which can receive a notification when a container starts up or shuts down, or when a message is processed. The value is the fully-qualified name of a Java class that implements TaskLifecycleListenerFactory. You can define multiple lifecycle listeners, each with a different listener-name, and reference them in task.lifecycle.listeners. | |
task.lifecycle.listeners | If you have defined lifecycle listeners with task.lifecycle.listener.*.class, you need to list them here in order to enable them. The value of this property is a comma-separated list of listener-name tokens. | |
task.drop.deserialization.errors | This property is to define how the system deals with deserialization failure situation. If set to true, the system will skip the error messages and keep running. If set to false, the system with throw exceptions and fail the container. Default is false. | |
task.drop.serialization.errors | This property is to define how the system deals with serialization failure situation. If set to true, the system will drop the error messages and keep running. If set to false, the system with throw exceptions and fail the container. Default is false. | |
task.poll.interval.ms | Samza's container polls for more messages under two conditions. The first condition arises when there are simply no remaining buffered messages to process for any input SystemStreamPartition. The second condition arises when some input SystemStreamPartitions have empty buffers, but some do not. In the latter case, a polling interval is defined to determine how often to refresh the empty SystemStreamPartition buffers. By default, this interval is 50ms, which means that any empty SystemStreamPartition buffer will be refreshed at least every 50ms. A higher value here means that empty SystemStreamPartitions will be refreshed less often, which means more latency is introduced, but less CPU and network will be used. Decreasing this value means that empty SystemStreamPartitions are refreshed more frequently, thereby introducing less latency, but increasing CPU and network utilization. | |
task.ignored.exceptions |
This property specifies which exceptions should be ignored if thrown in a task's process or window
methods. The exceptions to be ignored should be a comma-separated list of fully-qualified class names of the exceptions or
* to ignore all exceptions.
|
|
Systems (input and output streams) | ||
systems.system-name. samza.factory |
Required: The fully-qualified name of a Java class which provides a
system. A system can provide input streams which you can consume in your Samza job,
or output streams to which you can write, or both. The requirements on a system are very
flexible — it may connect to a message broker, or read and write files, or use a database,
or anything else. The class must implement
SystemFactory.
Samza ships with the following implementations:
|
|
systems.system-name. samza.key.serde |
The serde which will be used to deserialize the key of messages on input streams, and to serialize the key of messages on output streams. This property can be defined either for an individual stream, or for all streams within a system (if both are defined, the stream-level definition takes precedence). The value of this property must be a serde-name that is registered with serializers.registry.*.class. If this property is not set, messages are passed unmodified between the input stream consumer, the task and the output stream producer. | |
systems.system-name. streams.stream-name. samza.key.serde |
||
systems.system-name. samza.msg.serde |
The serde which will be used to deserialize the value of messages on input streams, and to serialize the value of messages on output streams. This property can be defined either for an individual stream, or for all streams within a system (if both are defined, the stream-level definition takes precedence). The value of this property must be a serde-name that is registered with serializers.registry.*.class. If this property is not set, messages are passed unmodified between the input stream consumer, the task and the output stream producer. | |
systems.system-name. streams.stream-name. samza.msg.serde |
||
systems.system-name. samza.offset.default |
upcoming |
If a container starts up without a checkpoint,
this property determines where in the input stream we should start consuming. The value must be an
OffsetType,
one of the following:
|
systems.system-name. streams.stream-name. samza.offset.default |
||
systems.system-name. streams.stream-name. samza.reset.offset |
false |
If set to true , when a Samza container starts up, it ignores any
checkpointed offset for this particular input
stream. Its behavior is thus determined by the samza.offset.default setting.
Note that the reset takes effect every time a container is started, which may be
every time you restart your job, or more frequently if a container fails and is restarted
by the framework.
|
systems.system-name. streams.stream-name. samza.priority |
-1 | If one or more streams have a priority set (any positive integer), they will be processed with higher priority than the other streams. You can set several streams to the same priority, or define multiple priority levels by assigning a higher number to the higher-priority streams. If a higher-priority stream has any messages available, they will always be processed first; messages from lower-priority streams are only processed when there are no new messages on higher-priority inputs. |
systems.system-name. streams.stream-name. samza.bootstrap |
false |
If set to true , this stream will be processed as a
bootstrap stream. This means that every time
a Samza container starts up, this stream will be fully consumed before messages from any
other stream are processed.
|
task.consumer.batch.size | 1 | If set to a positive integer, the task will try to consume batches with the given number of messages from each input stream, rather than consuming round-robin from all the input streams on each individual message. Setting this property can improve performance in some cases. |
Serializers/Deserializers (Serdes) | ||
serializers.registry. serde-name.class |
Use this property to register a serializer/deserializer,
which defines a way of encoding application objects as an array of bytes (used for messages
in streams, and for data in persistent storage). You can give a serde any
serde-name you want, and reference that name in properties like
systems.*.samza.key.serde,
systems.*.samza.msg.serde,
stores.*.key.serde and
stores.*.msg.serde.
The value of this property is the fully-qualified name of a Java class that implements
SerdeFactory.
Samza ships with several serdes:
|
|
Using the filesystem for checkpoints (This section applies if you have set task.checkpoint.factory = org.apache.samza.checkpoint.file.FileSystemCheckpointManagerFactory )
|
||
task.checkpoint.path | Required if you are using the filesystem for checkpoints. Set this to the path on your local filesystem where checkpoint files should be stored. | |
Using Kafka for input streams, output streams and checkpoints (This section applies if you have set systems.*.samza.factory = org.apache.samza.system.kafka.KafkaSystemFactory )
|
||
systems.system-name. consumer.zookeeper.connect |
The hostname and port of one or more Zookeeper nodes where information about the
Kafka cluster can be found. This is given as a comma-separated list of
hostname:port pairs, such as
zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181 .
If the cluster information is at some sub-path of the Zookeeper namespace, you need to
include the path at the end of the list of hostnames, for example:
zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181/clusters/my-kafka
|
|
systems.system-name. consumer.auto.offset.reset |
largest |
This setting determines what happens if a consumer attempts to read an offset that is
outside of the current valid range. This could happen if the topic does not exist, or
if a checkpoint is older than the maximum message history retained by the brokers.
This property is not to be confused with
systems.*.samza.offset.default,
which determines what happens if there is no checkpoint. The following are valid
values for auto.offset.reset :
|
systems.system-name. consumer.* |
Any Kafka consumer configuration
can be included here. For example, to change the socket timeout, you can set
systems.system-name.consumer.socket.timeout.ms.
(There is no need to configure group.id or client.id ,
as they are automatically configured by Samza. Also, there is no need to set
auto.commit.enable because Samza has its own checkpointing mechanism.)
|
|
systems.system-name. producer.metadata.broker.list |
A list of network endpoints where the Kafka brokers are running. This is given as
a comma-separated list of hostname:port pairs, for example
kafka1.example.com:9092,kafka2.example.com:9092,kafka3.example.com:9092 .
It's not necessary to list every single Kafka node in the cluster: Samza uses this
property in order to discover which topics and partitions are hosted on which broker.
This property is needed even if you are only consuming from Kafka, and not writing
to it, because Samza uses it to discover metadata about streams being consumed.
|
|
systems.system-name. producer.producer.type |
sync |
Controls whether messages emitted from a stream processor should be buffered before
they are sent to Kafka. The options are:
|
systems.system-name. producer.* |
Any Kafka producer configuration
can be included here. For example, to change the request timeout, you can set
systems.system-name.producer.request.timeout.ms.
(There is no need to configure client.id as it is automatically
configured by Samza.)
|
|
systems.system-name. samza.fetch.threshold |
50000 | When consuming streams from Kafka, a Samza container maintains an in-memory buffer for incoming messages in order to increase throughput (the stream task can continue processing buffered messages while new messages are fetched from Kafka). This parameter determines the number of messages we aim to buffer across all stream partitions consumed by a container. For example, if a container consumes 50 partitions, it will try to buffer 1000 messages per partition by default. When the number of buffered messages falls below that threshold, Samza fetches more messages from the Kafka broker to replenish the buffer. Increasing this parameter can increase a job's processing throughput, but also increases the amount of memory used. |
task.checkpoint.system |
This property is required if you are using Kafka for checkpoints
(task.checkpoint.factory
= org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory ).
You must set it to the system-name of a Kafka system. The stream
name (topic name) within that system is automatically determined from the job name and ID:
__samza_checkpoint_${job.name}_${job.id}
(with underscores in the job name and ID replaced by hyphens).
|
|
task.checkpoint. replication.factor |
3 | If you are using Kafka for checkpoints, this is the number of Kafka nodes to which you want the checkpoint topic replicated for durability. |
task.checkpoint. segment.bytes |
26214400 | If you are using Kafka for checkpoints, this is the segment size to be used for the checkpoint topic's log segments. Keeping this number small is useful because it increases the frequency that Kafka will garbage collect old checkpoints. |
Consuming all Kafka topics matching a regular expression (This section applies if you have set job.config.rewriter.*.class = org.apache.samza.config.RegExTopicGenerator )
|
||
job.config.rewriter. rewriter-name.system |
Set this property to the system-name of the Kafka system from which you want to consume all matching topics. | |
job.config.rewriter. rewriter-name.regex |
A regular expression specifying which topics you want to consume within the Kafka system job.config.rewriter.*.system. Any topics matched by this regular expression will be consumed in addition to any topics you specify with task.inputs. | |
job.config.rewriter. rewriter-name.config.* |
Any properties specified within this namespace are applied to the configuration of streams
that match the regex in
job.config.rewriter.*.regex.
For example, you can set job.config.rewriter.*.config.samza.msg.serde to configure
the deserializer for messages in the matching streams, which is equivalent to setting
systems.*.streams.*.samza.msg.serde
for each topic that matches the regex.
|
|
Storage and State Management | ||
stores.store-name.factory |
This property defines a store, Samza's mechanism for efficient
stateful stream processing. You can give a
store any store-name, and use that name to get a reference to the
store in your stream task (call
TaskContext.getStore()
in your task's
init()
method). The value of this property is the fully-qualified name of a Java class that implements
StorageEngineFactory.
Samza currently ships with two storage engine implementations:
|
|
stores.store-name.key.serde | If the storage engine expects keys in the store to be simple byte arrays, this serde allows the stream task to access the store using another object type as key. The value of this property must be a serde-name that is registered with serializers.registry.*.class. If this property is not set, keys are passed unmodified to the storage engine (and the changelog stream, if appropriate). | |
stores.store-name.msg.serde | If the storage engine expects values in the store to be simple byte arrays, this serde allows the stream task to access the store using another object type as value. The value of this property must be a serde-name that is registered with serializers.registry.*.class. If this property is not set, values are passed unmodified to the storage engine (and the changelog stream, if appropriate). | |
stores.store-name.changelog | Samza stores are local to a container. If the container fails, the contents of the store are lost. To prevent loss of data, you need to set this property to configure a changelog stream: Samza then ensures that writes to the store are replicated to this stream, and the store is restored from this stream after a failure. The value of this property is given in the form system-name.stream-name. Any output stream can be used as changelog, but you must ensure that only one job ever writes to a given changelog stream (each instance of a job and each store needs its own changelog stream). | |
Using RocksDB for key-value storage (This section applies if you have set stores.*.factory = org.apache.samza.storage.kv.RocksDbKeyValueStorageEngineFactory )
|
||
stores.store-name. write.batch.size |
500 | For better write performance, the storage engine buffers writes and applies them to the underlying store in a batch. If the same key is written multiple times in quick succession, this buffer also deduplicates writes to the same key. This property is set to the number of key/value pairs that should be kept in this in-memory buffer, per task instance. The number cannot be greater than stores.*.object.cache.size. |
stores.store-name. object.cache.size |
1000 | Samza maintains an additional cache in front of RocksDB for frequently-accessed objects. This cache contains deserialized objects (avoiding the deserialization overhead on cache hits), in contrast to the RocksDB block cache (stores.*.container.cache.size.bytes), which caches serialized objects. This property determines the number of objects to keep in Samza's cache, per task instance. This same cache is also used for write buffering (see stores.*.write.batch.size). A value of 0 disables all caching and batching. |
stores.store-name.container. cache.size.bytes |
104857600 | The size of RocksDB's block cache in bytes, per container. If there are several task instances within one container, each is given a proportional share of this cache. Note that this is an off-heap memory allocation, so the container's total memory use is the maximum JVM heap size plus the size of this cache. |
stores.store-name.container. write.buffer.size.bytes |
33554432 | The amount of memory (in bytes) that RocksDB uses for buffering writes before they are written to disk, per container. If there are several task instances within one container, each is given a proportional share of this buffer. This setting also determines the size of RocksDB's segment files. |
stores.store-name. rocksdb.compression |
snappy | This property controls whether RocksDB should compress data on disk and in the block cache. The following values are valid: |
stores.store-name. rocksdb.block.size.bytes |
4096 | If compression is enabled, RocksDB groups approximately this many uncompressed bytes into one compressed block. You probably don't need to change this property. |
stores.store-name. rocksdb.bloomfilter.bits |
10 | In RocksDB, every SST file contains a Bloom filter, which is used to determine if the file may contain a given key. Setting the bloom filter bit size allows developers to make the trade-off between the accuracy of the bloom filter, and its memory usage. |
stores.store-name. rocksdb.compaction.style |
universal | This property controls the compaction style that RocksDB will employ when compacting its levels. The following values are valid: |
stores.store-name. rocksdb.num.write.buffers |
3 | Configures the number of write buffers that a RocksDB store uses. This allows RocksDB to continue taking writes to other buffers even while a given write buffer is being flushed to disk. |
Using LevelDB for key-value storage (This section applies if you have set stores.*.factory = org.apache.samza.storage.kv.LevelDbKeyValueStorageEngineFactory )
|
||
stores.store-name. write.batch.size |
500 | For better write performance, the storage engine buffers writes and applies them to the underlying store in a batch. If the same key is written multiple times in quick succession, this buffer also deduplicates writes to the same key. This property is set to the number of key/value pairs that should be kept in this in-memory buffer, per task instance. The number cannot be greater than stores.*.object.cache.size. |
stores.store-name. object.cache.size |
1000 | Samza maintains an additional cache in front of LevelDB for frequently-accessed objects. This cache contains deserialized objects (avoiding the deserialization overhead on cache hits), in contrast to the LevelDB block cache (stores.*.container.cache.size.bytes), which caches serialized objects. This property determines the number of objects to keep in Samza's cache, per task instance. This same cache is also used for write buffering (see stores.*.write.batch.size). A value of 0 disables all caching and batching. |
stores.store-name.container. cache.size.bytes |
104857600 | The size of LevelDB's block cache in bytes, per container. If there are several task instances within one container, each is given a proportional share of this cache. Note that this is an off-heap memory allocation, so the container's total memory use is the maximum JVM heap size plus the size of this cache. |
stores.store-name.container. write.buffer.size.bytes |
33554432 | The amount of memory (in bytes) that LevelDB uses for buffering writes before they are written to disk, per container. If there are several task instances within one container, each is given a proportional share of this buffer. This setting also determines the size of LevelDB's segment files. |
stores.store-name. compaction.delete.threshold |
-1 | Setting this property forces a LevelDB compaction to be performed after a certain number of keys have been deleted from the store. This is used to work around performance issues in certain workloads. |
stores.store-name. leveldb.compression |
snappy |
This property controls whether LevelDB should compress data on disk and in the
block cache. The following values are valid:
|
stores.store-name. leveldb.block.size.bytes |
4096 | If compression is enabled, LevelDB groups approximately this many uncompressed bytes into one compressed block. You probably don't need to change this property. |
Running your job on a YARN cluster (This section applies if you have set job.factory.class = org.apache.samza.job.yarn.YarnJobFactory )
|
||
yarn.package.path |
Required for YARN jobs: The URL from which the job package can
be downloaded, for example a http:// or hdfs:// URL.
The job package is a .tar.gz file with a
specific directory structure.
|
|
yarn.container.count | 1 | The number of YARN containers to request for running your job. This is the main parameter for controlling the scale (allocated computing resources) of your job: to increase the parallelism of processing, you need to increase the number of containers. The minimum is one container, and the maximum number of containers is the number of task instances (usually the number of input stream partitions). Task instances are evenly distributed across the number of containers that you specify. |
yarn.container.memory.mb | 1024 | How much memory, in megabytes, to request from YARN per container of your job. Along with yarn.container.cpu.cores, this property determines how many containers YARN will run on one machine. If the container exceeds this limit, YARN will kill it, so it is important that the container's actual memory use remains below the limit. The amount of memory used is normally the JVM heap size (configured with task.opts), plus the size of any off-heap memory allocation (for example stores.*.container.cache.size.bytes), plus a safety margin to allow for JVM overheads. |
yarn.container.cpu.cores | 1 | The number of CPU cores to request from YARN per container of your job. Each node in the YARN cluster has a certain number of CPU cores available, so this number (along with yarn.container.memory.mb) determines how many containers can be run on one machine. Samza is single-threaded and designed to run on one CPU core, so you shouldn't normally need to change this property. |
yarn.container. retry.count |
8 | If a container fails, it is automatically restarted by YARN. However, if a container keeps failing shortly after startup, that indicates a deeper problem, so we should kill the job rather than retrying indefinitely. This property determines the maximum number of times we are willing to restart a failed container in quick succession (the time period is configured with yarn.container.retry.window.ms). Each container in the job is counted separately. If this property is set to 0, any failed container immediately causes the whole job to fail. If it is set to a negative number, there is no limit on the number of retries. |
yarn.container. retry.window.ms |
300000 |
This property determines how frequently a container is allowed to fail before we give up and
fail the job. If the same container has failed more than
yarn.container.retry.count
times, and the time between failures was less than this property
yarn.container.retry.window.ms (in milliseconds), then we fail the job.
There is no limit to the number of times we will restart a container if the time between
failures is greater than yarn.container.retry.window.ms .
|
yarn.am.container. memory.mb |
1024 | Each Samza job has one special container, the ApplicationMaster (AM), which manages the execution of the job. This property determines how much memory, in megabytes, to request from YARN for running the ApplicationMaster. |
yarn.am.opts | Any JVM options to include in the command line when executing the Samza ApplicationMaster. For example, this can be used to set the JVM heap size, to tune the garbage collector, or to enable remote debugging. | |
yarn.am.java.home |
The JAVA_HOME path for Samza AM. By setting this property, you can use a java version that is
different from your cluster's java version. Remember to set the task.java.home as well.
|
|
yarn.am.poll.interval.ms | 1000 | THe Samza ApplicationMaster sends regular heartbeats to the YARN ResourceManager to confirm that it is alive. This property determines the time (in milliseconds) between heartbeats. |
yarn.am.jmx.enabled | true |
Determines whether a JMX server should be started on this job's YARN ApplicationMaster
(true or false ).
|
Metrics | ||
metrics.reporter. reporter-name.class |
Samza automatically tracks various metrics which are useful for monitoring the health
of a job, and you can also track your own metrics.
With this property, you can define any number of metrics reporters which send
the metrics to a system of your choice (for graphing, alerting etc). You give each reporter
an arbitrary reporter-name. To enable the reporter, you need
to reference the reporter-name in
metrics.reporters.
The value of this property is the fully-qualified name of a Java class that implements
MetricsReporterFactory.
Samza ships with these implementations by default:
|
|
metrics.reporters | If you have defined any metrics reporters with metrics.reporter.*.class, you need to list them here in order to enable them. The value of this property is a comma-separated list of reporter-name tokens. | |
metrics.reporter. reporter-name.stream |
If you have registered the metrics reporter
metrics.reporter.*.class
= org.apache.samza.metrics.reporter.MetricsSnapshotReporterFactory ,
you need to set this property to configure the output stream to which the metrics data
should be sent. The stream is given in the form
system-name.stream-name,
and the system must be defined in the job configuration. It's fine for many different jobs
to publish their metrics to the same metrics stream. Samza defines a simple
JSON encoding for metrics; in order to use this
encoding, you also need to configure a serde for the metrics stream:
|