Run Hello Samza without Internet
This tutorial is to help you run Hello Samza if you can not connect to the internet.
Test Your Connection
Ping irc.wikimedia.org. Sometimes the firewall in your company blocks this service.
You should see something like this:
Trying 208.80.152.178...
Connected to ekrem.wikimedia.org.
Escape character is '^]'.
NOTICE AUTH :*** Processing connection to irc.pmtpa.wikimedia.org
NOTICE AUTH :*** Looking up your hostname...
NOTICE AUTH :*** Checking Ident
NOTICE AUTH :*** Found your hostname
Otherwise, you may have the connection problem.
Use Local Data to Run Hello Samza
We provide an alternative to get wikipedia feed data. Instead of running
You will run
This script will read wikipedia feed data from local file and produce them to the Kafka broker. By default, it produces to localhost:9092 as the Kafka broker and uses localhost:2181 as zookeeper. You can overwrite them:
Now you can go back to Generate Wikipedia Statistics section in Hello Samza and follow the remaining steps.
A Little Explanation
The goal of
is to deploy a Samza job which listens to wikipedia API, receives the feed in realtime and produces the feed to the Kafka topic wikipedia-raw. The alternative in this tutorial is reading local wikipedia feed in an infinite loop and producing the data to Kafka wikipedia-raw. The follow-up job, wikipedia-parser is getting data from Kafka topic wikipedia-raw, so as long as we have correct data in Kafka topic wikipedia-raw, we are fine. All Samza jobs are connected by the Kafka and do not depend on each other.