Writing to HDFS
samza-hdfs module implements a Samza Producer to write to HDFS. The current implementation includes a ready-to-use
HdfsSystemProducer, and three
HdfsWriters: One that writes messages of raw bytes to a
BytesWritable keys and values. Another writes UTF-8
Strings to a
LongWritable keys and
The last one writes out Avro data files including the schema automatically reflected from the POJO objects fed to it.
Configuring an HdfsSystemProducer
You can configure an HdfsSystemProducer like any other Samza system: using configuration keys and values set in a
You might configure the system producer for use by your
StreamTasks like this:
# set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to 'hdfs-clickstream' systems.hdfs-clickstream.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory # define a serializer/deserializer for the hdfs-clickstream system # DO NOT define (i.e. comment out) a SerDe when using the AvroDataFileHdfsWriter so it can reflect the schema systems.hdfs-clickstream.samza.msg.serde=some-serde-impl # consumer configs not needed for HDFS system, reader is not implemented yet # Assign a Metrics implementation via a label we defined earlier in the props file systems.hdfs-clickstream.streams.metrics.samza.msg.serde=some-metrics-impl # Assign the implementation class for this system's HdfsWriter systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.TextSequenceFileHdfsWriter #systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.AvroDataFileHdfsWriter # Set compression type supported by chosen Writer. Only BLOCK compression is supported currently # AvroDataFileHdfsWriter supports snappy, bzip2, deflate or none (null, anything other than the first three) systems.hdfs-clickstream.producer.hdfs.compression.type=snappy # The base dir for HDFS output. The default Bucketer for SequenceFile HdfsWriters # is currently /BASE/JOB_NAME/DATE_PATH/FILES, where BASE is set below systems.hdfs-clickstream.producer.hdfs.base.output.dir=/user/me/analytics/clickstream_data # Assign the implementation class for the HdfsWriter's Bucketer systems.hdfs-clickstream.producer.hdfs.bucketer.class=org.apache.samza.system.hdfs.writer.JobNameDateTimeBucketer # Configure the DATE_PATH the Bucketer will set to bucket output files by day for this job run. systems.hdfs-clickstream.producer.hdfs.bucketer.date.path.format=yyyy_MM_dd # Optionally set the max output bytes (records for AvroDataFileHdfsWriter) per file. # A new file will be cut and output continued on the next write call each time this many bytes # (records for AvroDataFileHdfsWriter) are written. systems.hdfs-clickstream.producer.hdfs.write.batch.size.bytes=134217728 #systems.hdfs-clickstream.producer.hdfs.write.batch.size.records=10000
The above configuration assumes a Metrics and Serde implemnetation has been properly configured against the
some-metrics-impl labels somewhere else in the same
job.properties file. Each of these properties has a reasonable default, so you can leave out the ones you don’t need to customize for your job run.