2.      
Install Mahout on sandbox ( yum install mahout )
3.      
This is an extension of chapter
. This chapter shows how to download and set up the Omniture logs, products and
user data.
·        
User table – 38455 rows
·        
Product table – 31 rows
·        
Omniture logs – 421266 rows
4.      
Processing the data to have the sequence id or
primary key for both Product and User and then applying it to Omniture logs.
Creating
Omniture data with both userId and ProductId. In this example we are using hash
to create userid and getting the product id from the URL to assign in
m_omniture table
This
hack is required as the mahout input data should have relation of UserId,
ProductId and Score (relation ship strength)
| 
create view m_omniture as  
Select  
col_2 ts, 
col_8 ip, 
col_13 url, 
substr(split(col_13, '/')[4],3) product_id, 
col_14 swid, 
positive(hash(col_14)) userId, 
col_50 city, 
col_51 country, 
col_53 state 
from omniturelogs | 
Creating
Product table with Product Id
| 
create view m_product as 
select  
                substr(split(url,
  '/')[4],3) product_id, 
                url
  url, 
                category
  category, 
             from
  products | 
Creating user table with UserId
| 
CREATE TABLE m_user as 
SELECT  
   
  postivie(hash(swid)) userId, 
    birth_dt
  bday, 
                gender_cd
  gender, 
                swid
  sessionId 
FROM user | 
5.      
Creating the Data for intake of the mahout
algorithm.
Here we are grouping the URL accessed by
user and based on the number of times and we are putting score as the number of
times user has accessed the URL.
We are also removing the URL for the home
page (without product ID as it will create a row with the null entry). The final
output is 3733 rows.
| 
hive -e 'select
  userId,product_id,count(*) relation from m_omniture group by
  userId,product_id' > /apps/mahout_input/temp.tsv  
 hadoop fs -put /tmp/temp.tsv
  /apps/mahout_input/ 
 hadoop fs -rmr /user/**/temp/* | 
Note : remove the null rows from the file (:g/NULL/
d)
6.      
Run the mahout algorithm to create the output
file. We are using item recommendations.
| 
mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i
  /apps/mahout_input/temp.tsv -o /apps/mahout_output --numRecommendations 3 | 
This will create the out put file
in mahout_output folder. For e.g. with user id and the list of recommendations.
| 
-2144047953         [55173281:28.527079,55175948:28.522099,55156528:28.460249,55170364:27.141182, 
55149415:26.843468,55173061:26.8081,55165149:26.568054,55166807:26.551744] 
-2142884193       [55177927:40.85718,55149415:37.643143,55179070:37.496075,55173281:37.040237, 
55175948:36.780933,55147564:36.429764,55169229:33.978317,55156528:33.863533] | 
7.      
Dump this output in the hbase for faster access
and also store the m_product table in Hbase
| 
On
  hive shell 
CREATE TABLE 
mahout_recommendations (id STRING,
  c1 STRING) 
STORED BY
  'org.apache.hcatalog.hbase.HBaseHCatStorageHandler' 
TBLPROPERTIES ( 
 
  'hbase.table.name' = 'mahout_recommendations', 
 
  'hbase.columns.mapping' = 'd:c1', 
 
  'hcat.hbase.output.bulkMode' = 'true' 
); 
vi
  pig.txt 
inpt = LOAD 'hdfs://sandbox.hortonworks.com/apps/mahout_output/itemitemout/part-r-00000'
  USING PigStorage('\t') AS (id:chararray, c1:chararray); 
STORE inpt INTO
  'mahout_recommendations' USING
  org.apache.pig.backend.hadoop.hbase.HBaseStorage ('d:c1'); 
execute 
pig -x local pig.txt | 
Load the product data in Hbase
| 
CREATE TABLE product_hbase(product_id string,
  category string, url string) 
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES
  ("hbase.columns.mapping"=":key,cf:category,cf:url"); 
SET hive.hbase.bulk=true; 
INSERT OVERWRITE TABLE product_hbase 
              SELECT
  product_id,category,url from m_products where product_id is not null; | 
8.      
Analyze the result either by logging to hbase
shell or start the hbase rest service and access it using the rest client.
| 
./bin/hbase-daemon.sh start rest -p 8500 
 curl
  http://localhost:8500/mahout_recommendations/-2135953226/d 
                 curl
  http://localhost:8500/product_hbase/-2135953226/cf | 
 
 
2 comments:
hadoop jar mahout-core-0.9.0.2.1.1.0-385-job.jar org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input /apps/mahout_input/temp.tsv --output /apps/mahout_output --similarityClassname SIMILARITY_LOGLIKELIHOOD --maxSimilaritiesPerItem 5
For item item based recommendations
Reference : http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/
Post a Comment