Wednesday, August 27, 2014

Simple Item-Based Recommendation using Mahout on Hortonworks sandbox

1.       Set up the hortonworks sandbox
2.       Install Mahout on sandbox ( yum install mahout )
3.       This is an extension of chapter . This chapter shows how to download and set up the Omniture logs, products and user data.
·         User table – 38455 rows
·         Product table – 31 rows
·         Omniture logs – 421266 rows
4.       Processing the data to have the sequence id or primary key for both Product and User and then applying it to Omniture logs.
Creating Omniture data with both userId and ProductId. In this example we are using hash to create userid and getting the product id from the URL to assign in m_omniture table
This hack is required as the mahout input data should have relation of UserId, ProductId and Score (relation ship strength)

create view m_omniture as
col_2 ts,
col_8 ip,
col_13 url,
substr(split(col_13, '/')[4],3) product_id,
col_14 swid,
positive(hash(col_14)) userId,
col_50 city,
col_51 country,
col_53 state
from omniturelogs

Creating Product table with Product Id
create view m_product as
                substr(split(url, '/')[4],3) product_id,
                url url,
                category category,
             from products

Creating user table with UserId
CREATE TABLE m_user as
    postivie(hash(swid)) userId,
    birth_dt bday,
                gender_cd gender,
                swid sessionId
FROM user

5.       Creating the Data for intake of the mahout algorithm.
Here we are grouping the URL accessed by user and based on the number of times and we are putting score as the number of times user has accessed the URL.
We are also removing the URL for the home page (without product ID as it will create a row with the null entry). The final output is 3733 rows.
hive -e 'select userId,product_id,count(*) relation from m_omniture group by userId,product_id' > /apps/mahout_input/temp.tsv

 hadoop fs -put /tmp/temp.tsv /apps/mahout_input/

 hadoop fs -rmr /user/**/temp/*

Note : remove the null rows from the file (:g/NULL/ d)

6.       Run the mahout algorithm to create the output file. We are using item recommendations.

mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /apps/mahout_input/temp.tsv -o /apps/mahout_output --numRecommendations 3
This will create the out put file in mahout_output folder. For e.g. with user id and the list of recommendations.

-2144047953         [55173281:28.527079,55175948:28.522099,55156528:28.460249,55170364:27.141182,
-2142884193       [55177927:40.85718,55149415:37.643143,55179070:37.496075,55173281:37.040237,

7.       Dump this output in the hbase for faster access and also store the m_product table in Hbase
On hive shell
mahout_recommendations (id STRING, c1 STRING)
STORED BY 'org.apache.hcatalog.hbase.HBaseHCatStorageHandler'
  '' = 'mahout_recommendations',
  'hbase.columns.mapping' = 'd:c1',
  'hcat.hbase.output.bulkMode' = 'true'

vi pig.txt
inpt = LOAD 'hdfs://' USING PigStorage('\t') AS (id:chararray, c1:chararray);
STORE inpt INTO 'mahout_recommendations' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('d:c1');

pig -x local pig.txt

Load the product data in Hbase
CREATE TABLE product_hbase(product_id string, category string, url string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping"=":key,cf:category,cf:url");

SET hive.hbase.bulk=true;

              SELECT product_id,category,url from m_products where product_id is not null;

8.       Analyze the result either by logging to hbase shell or start the hbase rest service and access it using the rest client.

./bin/ start rest -p 8500

 curl http://localhost:8500/mahout_recommendations/-2135953226/d
                 curl http://localhost:8500/product_hbase/-2135953226/cf


Unknown said...

hadoop jar mahout-core- --input /apps/mahout_input/temp.tsv --output /apps/mahout_output --similarityClassname SIMILARITY_LOGLIKELIHOOD --maxSimilaritiesPerItem 5

For item item based recommendations

Unknown said...

Reference :