29 June 2015 / machine-learning

Analysing Amazon's Product Reviews

Introduction

Given a bunch of reviews from Amazon, can we infer what topics these reviews might be about? There are a number of ways to go about doing this but in this post we look at how we can apply topic models to this task. Why should anyone care? Personally, I believe large companies with a reasonable social media presence can get rid of their ~~annoying~~ surveys and questionnaires by doing the type of awesomeness I show you in this post.

Topic Modelling

Topic Modelling is a process by which abstract topics/themes are extracted from a collection of documents. This process is usually carried out with the aid of topic models, a suite of algorithms used for topic modelling.

Latent Dirichlet Allocation (LDA) is a generative and probabilistic model that can be used to automatically group words into topics and documents into a mixture of topics [1]. It works based on the assumption that each document contains one or more topics as they often do in natural language. For a nice intuitive and more detailed explanation of LDA(and topic modelling), please see this amazing blog post by Edwin Chen.

Topic models have been applied on a wide variety of problems like disaster related data from Twitter in an effort to determine what topics were discussed within the time span of a natural disaster[2]. Another example is in computer vision where it was used to analyse videos with complex and crowded scenes in other to discover regularities in the videos [3]. Such models can answer questions like: What are the most interesting things that happened in the last 24hrs of saved footage?

Dataset: Amazon Reviews

For this experiment/study, we make use of the Amazon Reviews Dataset[4] from the Stanford SNAP database. The dataset is separated into categories and for this study, we use the Electronics category. It contains 1,241,778 reviews of products under the electronics category on Amazon.

Pre-processing

Before we start mining topics from our data, we have to tokenise and transform it into a bag of words. Firstly, we define a token as a unigram and then we ignore tokens that are stop words. We have to remove stop words as they occur frequently across reviews causing the model to be biased towards them. Examples of such words include the, and, about, for, she... e.t.c. To take it a step further, we use the tf-idf weighting scheme to help weigh down words that are not important and vice versa. Basically, the tf-idf scheme is used to determine how important each word/token is to a document in a corpus.

In summary, we:

Tokenise the data into unigrams
Remove stop words from the dataset
Transform the unigrams to vector form by using the bag of words model
Apply tf-idf weighting transformation.

Running LDA on the Data

There are number of parameters that will require tuning when training our model(s) but I will skip that and focus on the results. Please leave a comment below or drop me an email if you want to know what my settings are.

**A 20 topics model**
Topic #	Topic-Words Distribution
0	tom, pink, subs, theatre, oil, shower, wildlife, microphones, vocals, humidifier
1	speakers, bass, speaker, subwoofer, sub, sound, grill, surround, bose, woofer
2	timely, cybershot, manner, 770sw, audrey, refunded, yr, arrived, abuse, yrs
3	radio, ipod, player, music, mp3, car, stereo, fm, sound, volume
4	casio, telescope, zi8, region, celestron, 20d, eyepiece, pilot, excellant, mountains
5	dvd, tv, hd, hdmi, vcr, remote, hdtv, tapes, player, dvds
6	surge, etrex, u3, protector, hub, 5mm, smc, blade, outlets, coasters
7	kingston, radar, detector, compass, microsd, wouldnt, teleconverter, modulator
8	pentax, kindle, shredder, knife, ps2, labeler, leica, mics, roomy, merchandise
9	mouse, keyboard, keys, keyboards, logitech, scroll, wheel, wrist, cordless, key
10	lens, camera, zoom, shots, pictures, nikon, canon, focus, cameras, tripod
11	printer, ink, cartridge, cartridges, print, paper, printing, epson, hp, scanner
12	great, good, product, use, camera, just, works, price, quality, easy
13	garmin, gps, maps, map, projector, nuvi, windshield, visor, route, palm
14	ipad, spindle, dragon, maxtor, topo, griffin, clocks, backing, laminating, cat6
15	headphones, ear, ears, headset, sound, comfortable, bass, pair, noise, buds
16	filter, filters, targus, lens, tomtom, xt, uv, polarizer, hoya, xpad
17	sirius, mount, keypad, gifts, golf, bolts, wall, tighten, d50, mounting
18	radio, antenna, stations, reception, suction, radios, fm, channels, xm, weather
19	bag, strap, case, tripod, backpack, shoulder, zipper, pockets, carry, sigma

Analysing the generated topics

We now have an in depth look at some of the topics generated by the model. We first look at the word distribution for a topic and make assumptions about what we expect the reviews to look like. Next, we look at some reviews belonging to this topic and see if our initial assumptions were accurate. Generally, we do not expect all topics generated by the topic models to be comprehensible so we only analyse a few of them.

Topic #1:

speakers, bass, speaker, subwoofer, sub, sound, grill, surround, bose, woofer

From the above distribution, we can infer that the topic described by these words is most likely something relating to sound. Words like speaker, subwoofer and bass all refer to attributes of sound systems. The word grill seems odd but we can ignore this since the words are ordered by importance. Words at the end of the list are less important than words at the beginning. Extracts from reviews with a sufficient proportion of this topic include:

I use these for my side and rear surround speakers...
The description on this speaker is WRONG! I contacted Amazon and they said they were going to fix it but they never did...
This is the best sound for a tower speaker,i pair them with...
I purchased the L830 speakers because they match my JBL system. Great sound from JBL...
These speakers sound great. Good highs and lows. Rich, warm sound that goes right through you...

From the extracts above, we can see that our initial assumption about the reviews under this topic was correct. Most of the reviews with a proportion of this topic seem to have the sound system theme. This means our model is able to extract reviews related to sounds systems from the Electronics category. Yaay!

Topic #3:

radio, ipod, player, music, mp3, car, stereo, fm, sound, volume

We can deduce from the distribution above that the attribute they are referring to is music players. The words either explain a music player or a feature of one. We would expect reviews belonging to this topic to follow the same logic.

While perhaps somewhat not obvious, it should be noted that we can regard a specific brand/type of music player as features when thinking of it in the context of all music players. Extracts from reviews with a sufficient proportion of this topic include:

I am very, very pleased with this product. Everyone I know has had a hard time finding a FM transmitter that works and sounds good.
This thing will work on one station decently until you take a turn on the highway (or go under power lines) and it goes to static
For what I paid for this MP3 player, it does exactly what I need it to. Small, cute, and easy to use. Radio tuner is not so hot, but playing tunes works great
This FM transmitter works. Its signal is weak, so volume on MP3 player AND on car radio both have to be pretty high
I was looking for an inexpensive MP3 player that accepted flash memory expansion cards to do the following:Listen to audiobooks

We can see from the extracts above that our assumptions were, to an extent, accurate. Interestingly, the reviews either talk about a music player itself or some attribute of the music player like the memory and radio facilities provided.

Topic #9:

mouse, keyboard, keys, keyboards, logitech, scroll, wheel, wrist, cordless, key

Here we have words that describe a mouse and keyboard as well as some of their features. For instance, words like scroll, wheel and cordless aren't physical products like the keyboard and mouse but are actually features of them while logitech perhaps refers to the company that produces/sells these products. Extracts from reviews with a sufficient proportion of this topic include:

This keyboard has the double-curve and light touch that reduce wrist pain. I felt the pain quickly on a normal keyboard...
I'm using an Iogear GSC102U KVM/USB switch to use this keyboard/mouse combo with both my PC and Mac. It works perfectly with both...
EasyCall Desktop Keyboard/Mouse ComboEasyCall Desktop Keyboard/Mouse Combo. I was more than a little surprised when I saw this here on Amazon, and read that it was first available in 2006...
A new optical version of this mouse is available...

As expected, extracts from reviews belonging to this topic validates our assumptions about this topic. We have reviews that talk about the keyboards or mouse and we also have reviews that talk about some attributes of those products.

Topic #11:

printer, ink, cartridge, cartridges, print, paper, printing, epson, hp, scanner

The word distribution above seem to have a printing theme. Most of them are features/parts of the a printer. Reviews belonging to this topic should follow the same direction and should mostly be about printing. Extracts from reviews with a sufficient proportion of this topic include:

Oh my, where do I start, have had this awful all in one for 4 years and I HATE IT.Paper jams galore, sucks in final sheet and then jams it...
I would not recommend Epson printers. I have an r200 and it frequently runs out of ink when the ink cartridges are not empty...
My Brother HL-1440 started printing garbage pages after six months of use. I decided to get a new one. This fit the bill.
My used HP 4550 Color Laserjet was purchased from a local computer dealer and I rate the deal as four-star. I'm rating the product on Amazon as three-stars since...
I bought a Canon MG5320 printer in September. I have several other brands of photo glossy paper on hand. I printed photos with...

As postulated, the extracts above shows that reviews belonging to this topic do in fact talk about printing. Something unexpected and strange is that nearly all the Spanish reviews are also clustered under this topic. I translated a number of these Spanish reviews hoping they all translated to something about printers but unfortunately, this was not the case.

Topic #19

bag, strap, case, tripod, backpack, shoulder, zipper, pockets, carry, sigma

Let us consider one final example. The distribution of words above seem to refer to accessories. Words such as "bag" and "case" refer to the actual accessories, "tripod" refers to one of the products they are designed for, "zipper" refers to a feature of the packaging accessory, etc. Extracts from reviews with a sufficient proportion of this topic include:

HAKUBA USA INC PSTC100 Tripod Case. The price made me skeptical, in spite of the great product reviews but this is a GREAT tripod bag...
Reviewers ZTT Fan and N. Barilka have complained that this bag is too small to accommodate their tripod/head combinations
Bags are tools... so select the right bag for the job. This bag was designed to do three things and I think it does them very well
Great bag. Fits my Nikon D40 with 18-200 lens attached, Tokina 24-12 lens, SS400 flash, cleaning kit pouch...
The McKleinUSA DAMEN 80715 Black 17 Detachable-Wheeled Laptop Case is a great buy

Once again, our guess was very much correct since the reviews above talk about the cases/bags that are used to carry/store the electronic products in.

Implementation

The entire system was written in Python. For topic modelling, I used gensim [5]. From its website, gensim is “topic modelling for humans”. It provides modules for Latent Dirichlet Allocation(LDA) and other topic models. Gensim also provides support for transforming documents to vectors so I used it for the pre-processing tasks(bag-of-words and tf-idf weighting). The code includes a README on how to train new models. I have a few trained models. If you are interested in trying them out then please do let me know, perhaps by leaving a comment below.

Summary

We were able to take > 1m reviews and detect topics discussed in that category. We can of course take this a step further and run topic modelling on sub categories. For instance, running a topic model on the printing related reviews above might yield even more fine grained topics about that product. To take it another step further, we can run sentiment analysis(angry, happy or neutral) on all reviews and try to come up with a way to compute topic-based sentiments. As a result, we can answer questions like "What are buyers saying about our printers and what are the general sentiments?". Useful right? Well I will leave the above extensions to the reader.

Cheers!

Awesomeness Achieved!

References

[1] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.

[2] Kireyev, Kirill, Leysia Palen, and K. Anderson. "Applications of topics models to analysis of disaster-related twitter data." NIPS Workshop on Applications for Topic Models: Text and Beyond. Vol. 1. 2009.

[3] Hospedales, Timothy, Shaogang Gong, and Tao Xiang. "A markov clustering topic model for mining behaviour in video." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.

[4] McAuley, Julian, and Jure Leskovec. "Hidden factors and hidden topics: understanding rating dimensions with review text." Proceedings of the 7th ACM conference on Recommender systems. ACM, 2013.

[5] Rehurek, Radim, and Petr Sojka. "Software framework for topic modelling with large corpora." (2010).