1.
Set up the
hortonworks sandbox
2.
Install Mahout on sandbox ( yum install mahout )
3.
This is an extension of
chapter
. This chapter shows how to download and set up the Omniture logs, products and
user data.
·
User table – 38455 rows
·
Product table – 31 rows
·
Omniture logs – 421266 rows
4.
Processing the data to have the sequence id or
primary key for both Product and User and then applying it to Omniture logs.
Creating
Omniture data with both userId and ProductId. In this example we are using hash
to create userid and getting the product id from the URL to assign in
m_omniture table
This
hack is required as the mahout input data should have relation of UserId,
ProductId and Score (relation ship strength)
create view m_omniture as
Select
col_2 ts,
col_8 ip,
col_13 url,
substr(split(col_13, '/')[4],3) product_id,
col_14 swid,
positive(hash(col_14)) userId,
col_50 city,
col_51 country,
col_53 state
from omniturelogs
|
Creating
Product table with Product Id
create view m_product as
select
substr(split(url,
'/')[4],3) product_id,
url
url,
category
category,
from
products
|
Creating user table with UserId
CREATE TABLE m_user as
SELECT
postivie(hash(swid)) userId,
birth_dt
bday,
gender_cd
gender,
swid
sessionId
FROM user
|
5.
Creating the Data for intake of the mahout
algorithm.
Here we are grouping the URL accessed by
user and based on the number of times and we are putting score as the number of
times user has accessed the URL.
We are also removing the URL for the home
page (without product ID as it will create a row with the null entry). The final
output is 3733 rows.
hive -e 'select
userId,product_id,count(*) relation from m_omniture group by
userId,product_id' > /apps/mahout_input/temp.tsv
hadoop fs -put /tmp/temp.tsv
/apps/mahout_input/
hadoop fs -rmr /user/**/temp/*
|
Note : remove the null rows from the file (:g/NULL/
d)
6.
Run the mahout algorithm to create the output
file. We are using item recommendations.
mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i
/apps/mahout_input/temp.tsv -o /apps/mahout_output --numRecommendations 3
|
This will create the out put file
in mahout_output folder. For e.g. with user id and the list of recommendations.
-2144047953 [55173281:28.527079,55175948:28.522099,55156528:28.460249,55170364:27.141182,
55149415:26.843468,55173061:26.8081,55165149:26.568054,55166807:26.551744]
-2142884193 [55177927:40.85718,55149415:37.643143,55179070:37.496075,55173281:37.040237,
55175948:36.780933,55147564:36.429764,55169229:33.978317,55156528:33.863533]
|
7.
Dump this output in the hbase for faster access
and also store the m_product table in Hbase
On
hive shell
CREATE TABLE
mahout_recommendations (id STRING,
c1 STRING)
STORED BY
'org.apache.hcatalog.hbase.HBaseHCatStorageHandler'
TBLPROPERTIES (
'hbase.table.name' = 'mahout_recommendations',
'hbase.columns.mapping' = 'd:c1',
'hcat.hbase.output.bulkMode' = 'true'
);
vi
pig.txt
inpt = LOAD 'hdfs://sandbox.hortonworks.com/apps/mahout_output/itemitemout/part-r-00000'
USING PigStorage('\t') AS (id:chararray, c1:chararray);
STORE inpt INTO
'mahout_recommendations' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage ('d:c1');
execute
pig -x local pig.txt
|
Load the product data in Hbase
CREATE TABLE product_hbase(product_id string,
category string, url string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
("hbase.columns.mapping"=":key,cf:category,cf:url");
SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE product_hbase
SELECT
product_id,category,url from m_products where product_id is not null;
|
8.
Analyze the result either by logging to hbase
shell or start the hbase rest service and access it using the rest client.
./bin/hbase-daemon.sh start rest -p 8500
curl
http://localhost:8500/mahout_recommendations/-2135953226/d
curl
http://localhost:8500/product_hbase/-2135953226/cf
|