A very basic configuration file looks like this:
There are 6 sections sections to a configuration file:
- The Application section defines things like the name of the job, job factory (See the job.factory.class property in Configuration Table), the class name for your StreamTask and serialization and deserialization of specific objects that are received and sent along different streams.
- The Systems & Streams section defines systems that your StreamTask can read from along with the types of serdes used for sending keys and messages from that system. You may use any of the predefined systems that Samza ships with, although you can also specify your own self-implemented Samza-compatible systems. See the hello-samza example project‘s Wikipedia system for a good example of a self-implemented system.
- The Checkpointing section defines how the messages processing state is saved, which provides fault-tolerant processing of streams (See Checkpointing for more details).
- The State Storage section defines the stateful stream processing settings for Samza.
- The Deployment section defines how the Samza application will be deployed (To a cluster manager (YARN), or as a standalone library) as well as settings for each option. See Deployment Models for more details.
- The Metrics section defines how the Samza application metrics will be monitored and collected. (See Monitoring)
Note that configuration keys prefixed with
sensitive. are treated specially, in that the values associated with such keys
will be masked in logs and Samza’s YARN ApplicationMaster UI. This is to prevent accidental disclosure only; no
encryption is done.