Wednesday, 30 November 2016

SINGLE VIEW DEMO USING ZEPPELIN


   Single View Demo Using Zeppelin


It help retailer to get a single view of a product by combining data from different sources and run interactive queries against the data.
In this case we collect from three different source
  • ETL/CRM data from EDW/DB - ingested using Apache Sqoop
  • Web traffic data - ingested using HDF
  • Social media data - ingested using HDF
Steps:
  1. Pre-reqs: Setup and import retail data set into DB/EDW (in this case: Postgres)
  2. Use Sqoop to import CRM/ERP data from DB/EDW into Hive
  3. Use HDF to import related tweets into Hive
  4. Use HDF to import simulated web traffic logs into Hive
  5. Use Hive to analyze tables to populate statistics
  6. Use Hive to correlate the data from multiple data sources

1.Prerequisite:
  • First give zeppelin to sudo privilege
echo "zeppelin  ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
  • Second enable zeppelin to log in to postgres
echo "host all all 127.0.0.1/32 md5" >> /var/lib/pgsql/data/pg_hba.conf

  • Create zeppelin user in Postgres
  • Setup Sqoop for Postgres
sudo wget https://jdbc.postgresql.org/download/postgresql-9.4.1207.jar -P /usr/hdp/current/sqoop-client/lib ls -la /usr/hdp/current/sqoop-client/lib/postgres*
  • Pull latest data from github
ls single-view-demo
  • Download retail data set
cd /tmp
unzip data.zip
head /tmp/data/Fact Sales.csv
  • Import data into Postgres
head -n 16 ~/single-view-demo/contoso-psql.sql
export PGPASSWORD=postgre password
psql -U postgres username -d databasename -h localhost -f ~/single-view-demo/contoso-psql.sql
psql -U postgres username -d databasename -h localhost -c "\dt"

2.EDW Optimization

  • Check access to Postgres tables from sqoop
sqoop list-tables --connect jdbc:postgresql://localhost:5432/contoso --username postgres username --password postgres password -- schema postgres
  • Make sure hive is started
  • Load data from psql into hive
sqoop import-all-tables --username postgres username --password postgres password --connect jdbc:postgresql://localhost:5432/database name  --hive-import  --direct
  • Create final table in ORC format and populate that final table from staging
  • Incremental import of data into Hive from RDBMS

Now that we did the one time bulk import, next we will setup an incremental sqoop job.First we create password file containing zeppelin user's PostGres password in HDFS. This is done to allow invocations of the job to be automated/scheduled (without having to manually pass the password )
echo -n "postgres username" > .password

hadoop fs -put .password /user/postgres username/

rm .password

  • Create incremental import sqoop job for final table
  1. –table: table the job is for (i.e. final table)
  2. –password-file: the HDFS location of the password file
  3. –incremental: lastmodified (we want to use lastmodified logic to find delta records)
  4. –check-column: specify which column that will be used to determine which delta records will be picked up
  • In Hive, move data from staging table to final table
  1. first remove the common records from final table that are also found in staging table
  2. copy data from staging table to final table
  3. truncate staging table

3. Import related tweets into Hive
  • To install Nifi, start the 'Install Wizard': Open Ambari (http://sandbox.hortonworks.com:8080) then:
  • On bottom left -> Actions -> Add service -> check NiFi server -> Next -> Next -> Change any config you like (e.g. install dir, port, setup_prebuilt or values in nifi.properties) -> Next -> Deploy. This will kick off the install which will run for 5-10min.
  • Now open Nifi webui (http://sandbox.hortonworks.com:9090)
  • Download pre built Single_view_demo.xml template to your laptop from https://raw.githubusercontent.com/abajwa-hw/single-view-demo/master/template/Single_view_demo.xml
  • Import the Single_view_demo.xml template info Nifi
  • Fill out env specific configs e.g. your Twitter credentials
  • Do not start the flows yet
  • Twitter requires you to have an account and obtain developer keys by registering an “app”. Follow steps below to do this:
  1. Create a Twitter account and app and get your consumer key/token and access keys/tokens:
  2. Open https://apps.twitter.com
  3. sign in
  4. create new app
  5. fill anything
  6. create access tokens
  • Create script to Generate weblog. Copy to a dir accessible by nifi

cp /home/zeppelin/single-view-demo/createlog-psql.sh /tmp
cp /home/zeppelin/single-view-demo/DimCustomer.csv /tmp
echo "/tmp/createlog-psql.sh /tmp/data/DimCustomer.csv 20 >> /tmp/webtraffic.log" >> /tmp/generate- weblog.sh
chmod 777 /tmp/createlog-psql.sh
chmod 777 /tmp/generate-weblog.sh
chmod 444 /tmp/DimCustomer.csv
  • Now start the flows in Nifi using the play button to:
  1. simulate and ingest the tweets under /tmp/weblog_staging in HDFS
  2. ingest the weblogs under /tmp/tweets_staging in HDFS
  • Once data has started to flow into the tweets and weblog HDFS locations, lets:

  1. update permissions on the HDFS dirs
  2. create hive tables for tweets and weblog
  3. check that hive tables contain data

4.Analyze tables/columns

  • Now let's run table level statistics on the tables we are interested in to improve query times
    
            analyze table tweets compute statistics
analyze table weblog compute statistics
analyze table final table statistics

  • Next let's compute column statistics for the same tables
analyze table tweets compute statistics for columns
analyze table weblog compute statistics for columns
analyze table final table compute statistics for columns

5.Use Hive to correlate the data from different sources

  • Query for top most visited product:

select product,count(1) popular from weblog group by product order by popular desc limit 10

  • Query for Top products with most time spent by year

select product,year_visited, sum(time_spent) as total_time from(select table1.product,year(visited) as year_visited, month(visited) as month,unix_timestamp(table1.lead_window_0) -unix_timestamp(table1.visited) as time_spent from (select product,visited,lead(visited)  over (PARTITION BY session ORDER BY visited asc) from weblog) table1
where table1.lead_window_0 is not NULL) table2 group by product,year_visited
order by total_time desc limit 100

  • Query for most popular web_path

select web_path,count(1) as path_count from (select session,concat_ws("->",collect_list(product)) as web_path from weblog group by session) table1 group by web_path order by path_count desc


6.Reference


7.Error faced

  • Resource manager starting error:
While starting virtualbox. It just stuck in loading resource manager that time just login into root and run these below command

/usr/lib/hue/tools/start_scripts/start_ambari.sh
  • Zeppelin server starting error:
Under Ambari while starting zeppelin service it unable to start the service that time run below command
 #   BAD_FILE=`find /usr/hdp/current/zeppelin-server/lib/notebook -name "*.json" -size 0` ; mv `dirname $BAD_FILE` /tmp/
         Or


  • NullPointerException in hive table at zeppelin :
If your query is "%hive select count(*)" from table you will see the query being sent look like "elect count(*) from table". This is because the parser looks for a "(prefix_name)" and here mistakes "(*)" for a prefix. That time use %hive(default) instead of %hive
Or


  • For add hive user and hive password in interpreter

  • First create unix user
           ssh root@server_ip_address(ex: ssh root@127.0.0.1 -p 2222)
           adduser username
passwd username

  • Then create ambari user with same user name

  • Then at %hive properties
hive.user= username
hive.password=password

  • Then click save button

  • Now hive user is added in hive interpreter

















     


















No comments:

Post a Comment