Reddit comment sentiment analysis

Introduction

This page is the result of a little 10% time project done by me at Connexity.com.
Connexity is an e-commerce firm connecting buyers and sellers across the internet and supports marketing efforts on a global scale.
This page is not an official statement of Connexity, but rather the result of me tinkering with interesting technologies and data with the goal of learning new stuff and doing something that could be useful for work further down the road. Time allotted for this project was 5 days.

Background

A current trend in marketing analytics is looking at brands in the context of social media. What are internet users thinking (and writing) about a brand? In the context of reddit: do they express positive or negative sentiments in their comments?
Since I was asked "What is reddit?" According to Wikipedia:

Reddit is a social media, social news aggregation, web content rating, and discussion website. Reddit's registered community members can submit content, such as text posts or direct links. Registered users can then vote submissions up or down to organize the posts and determine their position on the site's pages. The submissions with the most positive votes appear on the front page or the top of a category. Content entries are organized by areas of interest called "subreddits". The subreddit topics include news, science, gaming, movies, music, books, fitness, food, and image-sharing, among many others.
As of 2016, Reddit had 542 million monthly visitors (234 million unique users), ranking #11 most visited web-site in US and #25 in the world.
Across 2015, Reddit saw 82.54 billion pageviews, 73.15 million submissions, 725.85 million comments, and 6.89 billion upvotes from its users.

Using comment data of the reddit comment dataset, from one month, 2015-10, which is 5.5 GByte compressed and contains 56'026'955 valid comments.

Sentiment is calculated via a set of positive and negative word combinations selected specifically for social media. This is provides a rough estimate of the comment author's opinion, although it will not detect sarcasm.
Example for negative sentiment:

Have you seen the VW scandal how they were cheating on the diesel emission tests and the government banned them from selling diesels until they fix the problem.
(source: http://www.reddit.com/r/cars/comments/3mnw60/cvk887t , score: -0.9042)
Example for positive sentiment:
Not OP but I like my 33x12.50 Duratracs pretty well, they're definitely on par with most other ATs, plus they're great in snow and ice
(source: http://www.reddit.com/r/Jeep/comments/3p07hx/cw2ilkn , score: 0.957)

Car makers

Examining subreddits that are brands, using some car manufacturers.

Caveats

Some car makers have names that also appear in other contexts. Sometimes a ford is just "a shallow crossing on a river", other times it's someone talking about a famous fictional character from The Hitchhiker's Guide to the Galaxy (which may contribute to a higher ranking for ford).

The data is from a single month in 2015, and some brands only have a couple of thousand mentions, which may be not enough to prevent irregular uses of a brand name to dominate the results. If, for example, 2015-10 were the month when a funny reddit meme about honda is making the rounds, you could get a hundred humorous mentions that do not reflect the user's true sentiment. (Although using a brand name in a funny meme would probably reflect positively on the brand - for a short time at least).

Comment count per subreddit

Some brands have much more active communities than others. This may be influenced by reddit's demographic (prevalently < 40 year old men) and by the total market share of the given brand in the US.

    
SELECT
    subreddit,
    count(*) AS value
FROM comments
WHERE subreddit IN (SELECT name
                    FROM subreddits
                    WHERE name_condensed IN
                          ('volkswagen', 'subaru', 'jeep', 'bmw', 'mazda', 'teslamotors', 'honda', 'audi', 'volvo', 'ford', 'toyota'))
GROUP BY subreddit
ORDER BY subreddit ASC;
    

Average score

Score is generally speaking the number of upvotes a comment has received, but to the high end (> 5000 points) the value is capped or fuzzed to discourage bots and cheating.

The average score over all comments from that month is 5.9590, but you have to consider that useful, funny or ironic comments receive a much more upvotes than some guy posting "I have a car and it looks nice, look at this pic". Your car will almost never gain as much internet karma points as a grumpy cat riding a roomba.

If you compare the total number of comments vs the average score, you can see that although the Subaru subreddit has 27% more comments than the Tesla forum, the average score is about 30% lower. Perhaps Tesla fans are more emotionally invested in their brand and feel the need to share their enthusiasm through upvotes. But that's just speculation. Looking below at the average sentiment scores, it looks like the Tesla commenters are not a happy group.

    
SELECT
    subreddit,
    avg(score) :: NUMERIC(7, 4) AS value
FROM comments
WHERE subreddit IN (SELECT name
                    FROM subreddits
                    WHERE name_condensed IN
                          ('volkswagen', 'subaru', 'jeep', 'bmw', 'mazda', 'teslamotors', 'honda', 'audi', 'volvo', 'ford', 'toyota'))
GROUP BY subreddit
ORDER BY subreddit ASC;
    

Average sentiment score

Compares the sentiment in the given car manufacturer's subreddit, for example the Toyota bar in the graphic corresponds to the sentiments expressed in the subreddit /r/Toyota.

The --average-- value was added to show the total average regardless of brand (average over *all* available comments from that month in *all* subreddits).

While you would not post into automotive subreddits for upvotes, it's interesting to see that they all have a higher than average sentiment score. BMW and Toyota fans are happy and positive it seems... while people on the Tesla and Volkswagen subreddits express much lower positive sentiments on average.

    
SELECT
    subreddit,
    avg(sentiment_score) :: NUMERIC(7, 4) AS value
FROM comments
WHERE subreddit IN (SELECT name
                    FROM subreddits
                    WHERE name_condensed IN
                          ('volkswagen', 'subaru', 'jeep', 'bmw', 'mazda', 'teslamotors', 'honda', 'audi', 'volvo', 'ford', 'toyota'))
GROUP BY subreddit
ORDER BY subreddit ASC;
    

Average sentiment over all comments of that mention a brand in any way

The --average-- value was added to show the total average regardless of brand.

"VW" instead of "Volkswagen" has an average sentiment score of 0.0408, almost the same as Volkswagen's 0.0412

    
       SELECT
           brand_name,
           avg(sentiment_score) :: NUMERIC(7, 4) AS value
       FROM comments,
           (VALUES ('volkswagen'), ('subaru'), ('jeep'), ('bmw'), ('mazda'), ('tesla'), ('honda'), ('audi'), ('volvo'), ('ford'),
               ('toyota')) AS brands(brand_name)

       WHERE body ~* concat('.*\W', brand_name, '\W.*')
       GROUP BY brand_name
       ORDER BY brand_name;
    

Combined: all-subreddits vs brand-subreddits

It is not surprising that people talk with a more positive sentiment about a brand when they are commenting inside the brand's subreddit rather than in a general comment. What is strange is the big difference between individual brands as seen from the outside (all over reddit) vs the inside (in its own subreddit). Volkswagen sure does have a negative image (in October 2015) - a comment mentioning VW has on average a sentiment score that is only half as high as a random other comment. Tesla fans are on average only slightly more positive in the company's subreddit compared to reddit in general, while BMW and Toyota enthusiasts are almost twice as positive talking about their car than the commenter next door.

Count of comments that mention a brand in any way

Note that both Ford and Tesla are also names of (real and fictional) people and will appear more often than other brand names.

Volkswagen may be abbreviated to VW (count: 9068).

    
      SELECT
          brand_name,
          count(*)
      FROM comments,
          (VALUES ('volkswagen'), ('subaru'), ('jeep'), ('bmw'), ('mazda'), ('tesla'), ('honda'), ('audi'), ('volvo'), ('ford'),
              ('toyota')) AS brands(brand_name)

      WHERE body ~* concat('.*\W', brand_name, '\W.*')
      GROUP BY brand_name
      ORDER BY brand_name;
    

Technology stack and data sources

Perspectives

The current statistics only scratch the surface of what's possible with sentiment and brand analysis. It would be interesting:

Bonus

Extended brand subreddit scores

If you are looking for internet karma points, talk about Tesla and cars in general :)

Extended brand subreddit sentiment scores

It would be nice to compare average brand sentiment vs average price of a car of this brand. Is there a correlation between the price of a car and the sentiment people express about it? But the scores for Tesla and Mercedes do not seem to support this thesis.