I, Me and Myself: August 2014

1. Set up the hortonworks sandbox

2. Install Mahout on sandbox ( yum install mahout )

3. This is an extension of chapter . This chapter shows how to download and set up the Omniture logs, products and user data.

· User table – 38455 rows

· Product table – 31 rows

· Omniture logs – 421266 rows

4. Processing the data to have the sequence id or primary key for both Product and User and then applying it to Omniture logs.

Creating Omniture data with both userId and ProductId. In this example we are using hash to create userid and getting the product id from the URL to assign in m_omniture table

This hack is required as the mahout input data should have relation of UserId, ProductId and Score (relation ship strength)

create view m_omniture as

Select

col_2 ts,

col_8 ip,

col_13 url,

substr(split(col_13, '/')[4],3) product_id,

col_14 swid,

positive(hash(col_14)) userId,

col_50 city,

col_51 country,

col_53 state

from omniturelogs

Creating Product table with Product Id

create view m_product as

select

substr(split(url, '/')[4],3) product_id,

url url,

category category,

from products

Creating user table with UserId

CREATE TABLE m_user as

SELECT

postivie(hash(swid)) userId,

birth_dt bday,

gender_cd gender,

swid sessionId

FROM user

5. Creating the Data for intake of the mahout algorithm.

Here we are grouping the URL accessed by user and based on the number of times and we are putting score as the number of times user has accessed the URL.

We are also removing the URL for the home page (without product ID as it will create a row with the null entry). The final output is 3733 rows.

hive -e 'select userId,product_id,count(*) relation from m_omniture group by userId,product_id' > /apps/mahout_input/temp.tsv

hadoop fs -put /tmp/temp.tsv /apps/mahout_input/

hadoop fs -rmr /user/**/temp/*

Note : remove the null rows from the file (:g/NULL/ d)

6. Run the mahout algorithm to create the output file. We are using item recommendations.

mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /apps/mahout_input/temp.tsv -o /apps/mahout_output --numRecommendations 3

This will create the out put file in mahout_output folder. For e.g. with user id and the list of recommendations.

-2144047953 [55173281:28.527079,55175948:28.522099,55156528:28.460249,55170364:27.141182,

55149415:26.843468,55173061:26.8081,55165149:26.568054,55166807:26.551744]

-2142884193 [55177927:40.85718,55149415:37.643143,55179070:37.496075,55173281:37.040237,

55175948:36.780933,55147564:36.429764,55169229:33.978317,55156528:33.863533]

7. Dump this output in the hbase for faster access and also store the m_product table in Hbase

On hive shell

CREATE TABLE

mahout_recommendations (id STRING, c1 STRING)

STORED BY 'org.apache.hcatalog.hbase.HBaseHCatStorageHandler'

TBLPROPERTIES (

'hbase.table.name' = 'mahout_recommendations',

'hbase.columns.mapping' = 'd:c1',

'hcat.hbase.output.bulkMode' = 'true'

);

vi pig.txt

inpt = LOAD 'hdfs://sandbox.hortonworks.com/apps/mahout_output/itemitemout/part-r-00000' USING PigStorage('\t') AS (id:chararray, c1:chararray);

STORE inpt INTO 'mahout_recommendations' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('d:c1');

execute

pig -x local pig.txt

Load the product data in Hbase

CREATE TABLE product_hbase(product_id string, category string, url string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping"=":key,cf:category,cf:url");

SET hive.hbase.bulk=true;

INSERT OVERWRITE TABLE product_hbase

SELECT product_id,category,url from m_products where product_id is not null;

8. Analyze the result either by logging to hbase shell or start the hbase rest service and access it using the rest client.

./bin/hbase-daemon.sh start rest -p 8500

curl http://localhost:8500/mahout_recommendations/-2135953226/d

curl http://localhost:8500/product_hbase/-2135953226/cf

I, Me and Myself

Wednesday, August 27, 2014

Simple Item-Based Recommendation using Mahout on Hortonworks sandbox

About Me

Blog Archive