Narasimma Varman: SINGLE VIEW DEMO USING ZEPPELIN

Single View Demo Using Zeppelin

It help retailer to get a single view of a product by combining data from different sources and run interactive queries against the data.

In this case we collect from three different source

ETL/CRM data from EDW/DB - ingested using Apache Sqoop
Web traffic data - ingested using HDF
Social media data - ingested using HDF

Steps:

Pre-reqs: Setup and import retail data set into DB/EDW (in this case: Postgres)
Use Sqoop to import CRM/ERP data from DB/EDW into Hive
Use HDF to import related tweets into Hive
Use HDF to import simulated web traffic logs into Hive
Use Hive to analyze tables to populate statistics
Use Hive to correlate the data from multiple data sources

1.Prerequisite:

First give zeppelin to sudo privilege

echo "zeppelin ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

Second enable zeppelin to log in to postgres

echo "host all all 127.0.0.1/32 md5" >> /var/lib/pgsql/data/pg_hba.conf

Create zeppelin user in Postgres
Setup Sqoop for Postgres

sudo wget https://jdbc.postgresql.org/download/postgresql-9.4.1207.jar -P /usr/hdp/current/sqoop-client/lib ls -la /usr/hdp/current/sqoop-client/lib/postgres*

Pull latest data from github

git clone https://github.com/abajwa-hw/single-view-demo.git

ls single-view-demo

Download retail data set

cd /tmp

wget https://www.dropbox.com/s/r70i8j1ujx4h7j8/data.zip

unzip data.zip

head /tmp/data/Fact Sales.csv

Import data into Postgres

head -n 16 ~/single-view-demo/contoso-psql.sql

export PGPASSWORD=postgre password

psql -U postgres username -d databasename -h localhost -f ~/single-view-demo/contoso-psql.sql

psql -U postgres username -d databasename -h localhost -c "\dt"

2.EDW Optimization

Check access to Postgres tables from sqoop

sqoop list-tables --connect jdbc:postgresql://localhost:5432/contoso --username postgres username --password postgres password -- schema postgres

Make sure hive is started
Load data from psql into hive

sqoop import-all-tables --username postgres username --password postgres password --connect jdbc:postgresql://localhost:5432/database name --hive-import --direct

Create final table in ORC format and populate that final table from staging

Incremental import of data into Hive from RDBMS

Now that we did the one time bulk import, next we will setup an incremental sqoop job.First we create password file containing zeppelin user's PostGres password in HDFS. This is done to allow invocations of the job to be automated/scheduled (without having to manually pass the password )

echo -n "postgres username" > .password

hadoop fs -put .password /user/postgres username/

rm .password

Create incremental import sqoop job for final table

–table: table the job is for (i.e. final table)
–password-file: the HDFS location of the password file
–incremental: lastmodified (we want to use lastmodified logic to find delta records)
–check-column: specify which column that will be used to determine which delta records will be picked up

In Hive, move data from staging table to final table

first remove the common records from final table that are also found in staging table
copy data from staging table to final table
truncate staging table

3. Import related tweets into Hive

To install Nifi, start the 'Install Wizard': Open Ambari (http://sandbox.hortonworks.com:8080) then:
On bottom left -> Actions -> Add service -> check NiFi server -> Next -> Next -> Change any config you like (e.g. install dir, port, setup_prebuilt or values in nifi.properties) -> Next -> Deploy. This will kick off the install which will run for 5-10min.
Now open Nifi webui (http://sandbox.hortonworks.com:9090)
Download pre built Single_view_demo.xml template to your laptop from https://raw.githubusercontent.com/abajwa-hw/single-view-demo/master/template/Single_view_demo.xml
Import the Single_view_demo.xml template info Nifi
Fill out env specific configs e.g. your Twitter credentials
Do not start the flows yet
Twitter requires you to have an account and obtain developer keys by registering an “app”. Follow steps below to do this:

Create a Twitter account and app and get your consumer key/token and access keys/tokens:
Open https://apps.twitter.com
sign in
create new app
fill anything
create access tokens

Create script to Generate weblog. Copy to a dir accessible by nifi

cp /home/zeppelin/single-view-demo/createlog-psql.sh /tmp

cp /home/zeppelin/single-view-demo/DimCustomer.csv /tmp

echo "/tmp/createlog-psql.sh /tmp/data/DimCustomer.csv 20 >> /tmp/webtraffic.log" >> /tmp/generate- weblog.sh

chmod 777 /tmp/createlog-psql.sh

chmod 777 /tmp/generate-weblog.sh

chmod 444 /tmp/DimCustomer.csv

Now start the flows in Nifi using the play button to:

simulate and ingest the tweets under /tmp/weblog_staging in HDFS
ingest the weblogs under /tmp/tweets_staging in HDFS

Once data has started to flow into the tweets and weblog HDFS locations, lets:

update permissions on the HDFS dirs
create hive tables for tweets and weblog
check that hive tables contain data

4.Analyze tables/columns

Now let's run table level statistics on the tables we are interested in to improve query times

analyze table tweets compute statistics

analyze table weblog compute statistics

analyze table final table statistics

Next let's compute column statistics for the same tables

analyze table tweets compute statistics for columns

analyze table weblog compute statistics for columns

analyze table final table compute statistics for columns

5.Use Hive to correlate the data from different sources

Query for top most visited product:

select product,count(1) popular from weblog group by product order by popular desc limit 10

Query for Top products with most time spent by year

select product,year_visited, sum(time_spent) as total_time from(select table1.product,year(visited) as year_visited, month(visited) as month,unix_timestamp(table1.lead_window_0) -unix_timestamp(table1.visited) as time_spent from (select product,visited,lead(visited) over (PARTITION BY session ORDER BY visited asc) from weblog) table1

where table1.lead_window_0 is not NULL) table2 group by product,year_visited

order by total_time desc limit 100

Query for most popular web_path

select web_path,count(1) as path_count from (select session,concat_ws("->",collect_list(product)) as web_path from weblog group by session) table1 group by web_path order by path_count desc

6.Reference

7.Error faced

Resource manager starting error:

While starting virtualbox. It just stuck in loading resource manager that time just login into root and run these below command

/usr/lib/hue/tools/start_scripts/start_ambari.sh

Zeppelin server starting error:

Under Ambari while starting zeppelin service it unable to start the service that time run below command

# BAD_FILE=`find /usr/hdp/current/zeppelin-server/lib/notebook -name "*.json" -size 0` ; mv `dirname $BAD_FILE` /tmp/

URL:https://community.hortonworks.com/questions/25881/zeppelin-notebook-will-not-start-in-sandbox-24.html

NullPointerException in hive table at zeppelin :

If your query is "%hive select count(*)" from table you will see the query being sent look like "elect count(*) from table". This is because the parser looks for a "(prefix_name)" and here mistakes "(*)" for a prefix. That time use %hive(default) instead of %hive

URL:https://community.hortonworks.com/questions/23197/nullpointerexception-on-query.html

For add hive user and hive password in interpreter

First create unix user

ssh root@server_ip_address(ex: ssh root@127.0.0.1 -p 2222)

adduser username

passwd username

Then create ambari user with same user name

Then at %hive properties

hive.user= username

hive.password=password

Then click save button

Now hive user is added in hive interpreter

Narasimma Varman

Wednesday, 30 November 2016

SINGLE VIEW DEMO USING ZEPPELIN

Incremental import of data into Hive from RDBMS

4.Analyze tables/columns

5.Use Hive to correlate the data from different sources

No comments:

Post a Comment

Blog Archive