Objective:

Assign a sentiment score to reviews (0.0 to 1.0 where a zero score is a full negative sentiment).

For this task, we will use a reviews file, which contains a snapshot of reviews from recent months (first line is a header line) for a travel company

We would like to do following :

  1. Explain the different stages we took
  2. We want Score to be ‘close enough’ to real world performance.
  3. Identify some key findings/insights we extracted from the data

Brief Steps to explain overall process

We will do following steps to create unsupervised sentiment labeling for the review data.

  1. Create a sentiment Lexicon based on sentiment words, adjectives, adverbs, negation words. Files are located in data/sentiment folder.
  2. Load review data into dataframe and perform following steps:
    1. Split and tokenize paragraphs into sentences and then into words (we use nltk for this)
    2. perform data analysis on the reviews dataframe using sentiment lexicon and code for more information
  3. Write code to assign sentiment scores between 0.0 to 0.1
In [1]:
# Load libraries and some settings 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import nltk
import math

# print graphs inline
%matplotlib inline
plt.rcParams["figure.figsize"] = (20,3)


pd.options.mode.chained_assignment = None  # default='warn'
In [2]:
# download Punkt sentence tokenizer and pos tagger. 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sanjaymeena/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sanjaymeena/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Out[2]:
True

Create Sentiment Lexicon Dataframe

We will load following data :

1. List of sentiment words and their valence
2. List of negative words
3. List of degree words (adjectives, adverbs) based on whether they increse or decrease the degree 

References for these data :

  1. https://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentiment-analysis.pdf
  2. https://positivewordsresearch.com/list-of-positive-words/
  3. https://github.com/bohana/sentlex
In [3]:
# load sentiment dictionary
sentiment_dict=pd.read_table('data/sentiment/AFINN-111.txt', header=None)
sentiment_dict.columns=['word','valence']
sentiment_dict['word_type']='sentiment'
#sentiment_dict.head()

# load negation words
negative_words=pd.read_table('data/sentiment/negate.txt', header=None)
negative_words.columns=['word']
# negative_words['valence']=0
negative_words['word_type']='negation'
# load degree words
degree_words=pd.read_table('data/sentiment/degree_words.txt', header=None)
degree_words.columns=['word','degree']
degree_words['word_type']='degree'
# degree_words['valence']=0
degree_words.head()


#also load 1 more source of positive negative sentiment words
ptrckprry_negative_words=pd.read_table('data/sentiment/negative-words.txt', header=None,comment=';')
ptrckprry_negative_words.columns=['word']
ptrckprry_negative_words['word_type']='sentiment'
ptrckprry_negative_words['valence']=-1



ptrckprry_positive_words=pd.read_table('data/sentiment/positive-words.txt', header=None,comment=';')
ptrckprry_positive_words.columns=['word']
ptrckprry_positive_words['word_type']='sentiment'
ptrckprry_positive_words['valence']=1



# concat to create final sentiment lexicon
sentiment_lexicon=pd.concat([sentiment_dict,ptrckprry_positive_words,ptrckprry_negative_words,negative_words,degree_words], ignore_index=True)


sentiment_lexicon['valence']=sentiment_lexicon['valence'].fillna(0)
sentiment_lexicon['degree']=sentiment_lexicon['degree'].fillna('none')
In [21]:
sentiment_lexicon.sample(n=10)
Out[21]:
degree valence word word_type
3606 none 1.0 magic sentiment
4494 none -1.0 abrade sentiment
3674 none 1.0 neatly sentiment
7426 none -1.0 mocks sentiment
9342 decr 0.0 sort of degree
4550 none -1.0 adulterier sentiment
7742 none -1.0 perplexed sentiment
4955 none -1.0 bully sentiment
8676 none -1.0 tank sentiment
5506 none -1.0 deterioration sentiment

Some information about the data in the sentiment lexicon . As we use lexicon information, a better and more richer lexicon will lead to better results

In [22]:
print 'total word types in sentiment frame : ' , set(sentiment_lexicon['word_type'])
print 'total sentiment words : ' , len(sentiment_lexicon[sentiment_lexicon['word_type']=='sentiment'])
print 'total negation words : ' , len(sentiment_lexicon[sentiment_lexicon['word_type']=='negation'])
print 'total positive sentiment words : ' , len(sentiment_lexicon[sentiment_lexicon['valence'] > 0])
print 'total negative sentiment words : ' , len(sentiment_lexicon[sentiment_lexicon['valence']  < 0])
print 'total degree words : ' , len(sentiment_lexicon[sentiment_lexicon['degree']!='none'])
total word types in sentiment frame :  set(['degree', 'negation', 'sentiment'])
total sentiment words :  9266
total negation words :  60
total positive sentiment words :  2884
total negative sentiment words :  6381
total degree words :  66

Load the Reviews Data

In [6]:
df = pd.read_csv('data/reviews.csv')
In [7]:
print 'Reviews dataframe shape ' , df.shape
print 'Total unique review titles: ' , len(set(df['review_title']))

# print df head
df.head()
Reviews dataframe shape  (274339, 3)
Total unique review titles:  166759
Out[7]:
hotel_review_id review_title review_comments
0 103237986.0 Friendly staff and comfortable stay Continental breakfast is 7/10
1 103237985.0 Budget hotel Hotel is OK but the breakfast taste is so so.
2 103237979.0 Don't set high expectations The hotel has a beautiful lobby, delicately de...
3 103237975.0 Good location Good location and good view. But check-out tim...
4 103237974.0 Need Renovations The rooms and facilities are old and needs tot...

We will define many functions to help with the data analysis and data wrangling part

  • Sentence Splitter and tokenizer
  • Tag sentence with part of speech (POS)
  • Create a sentiment word dictionary for a word token
In [132]:
# paths 

nltk_splitter=nltk.data.load('data/nltk/tokenizers/punkt/english.pickle')

nltk_tokenizer = nltk.tokenize.TreebankWordTokenizer()

# function to split paragraph and then tokenize a sentence to words

def isnan(value):
    try:
      
      return math.isnan(float(value))
    except:
        return False


def split_text(data):
    'split paragraph into sentences and then sentences into words'
    
    tokenized_sentences=[]
    #print data
    if not isnan(data):
        sentences = nltk_splitter.sentences_from_text(data)
        for sentence in sentences:
            for word in nltk_tokenizer.tokenize(sentence) :
                tokenized_sentences.append(word)
        #tokenized_sentences  = [words in nltk_tokenizer.tokenize(sentence) for sentence in sentences]
    
    return tokenized_sentences

# function to tag pos and create data for word in format : [word, pos]
def tag_pos(sentence):
    'tag sentences with pos information and output words in format : [word,pos]'
    
    
    pos = nltk.pos_tag(sentence) 
    
    #print pos
    return pos


# adds sentiment info for the word using the sentiment lexicon
def getSentimentWordDictionary(word,pos):
    "adds sentiment info for the word using the sentiment lexicon"
    
    #word=word.lower()
    dictionary={}
    dictionary['pos']=pos
    dictionary['word']=word
    
    if word in sentiment_lexicon['word'].values:
        #print sentiment_lexicon['word']
        dictword=sentiment_lexicon[sentiment_lexicon['word']==word]
        dictionary['type']=dictword['word_type'].values[0]
        dictionary['valence']=float(dictword['valence'].values[0])
        dictionary['degree']=dictword['degree'].values[0]
    else:
        dictionary['type']='none'
        dictionary['valence']=0
   
        
    return dictionary

We will also create functions to create the sentiment dictonary from the reviews data.It is also possible for us toit extend sentiment lexicon using reviews data but we won't do it for now

In [133]:
def review_sentiment_dict(dataframe):
    "create sentiment dictionary for the reviews data using the sentiment lexicon. For the first word in the sentence\
    , we lowercase it. If the word is found in the sentiment lexicon, we create sentiment dictionary for that word"
    review_sent_dict={}
    
#     seen_set=set()
    pos=''
    
    data=[]
    
    # we will add sentences from both reviews and review_title
    data=dataframe['tokenized'].tolist()
    data.extend(dataframe['review_title_tokenized'].tolist())
    
    for review in dataframe.tokenized:
        # lowercase the words
        #review=[word.lower() for word in review]
        review=[word for word in review]
        sentiment_words=[]
        for index,word in enumerate(review):
            if index==0:
                word=word.lower()
            if word not in review_sent_dict : 
#                 seen_set.add(word)
                word_dict=getSentimentWordDictionary(word,pos)
#                 if len(word_dict.keys())> 2 and word_dict['type'] !='none' :
                review_sent_dict[word]=word_dict
                    
                    
    return review_sent_dict      

We will create a copy of the data frame and add more useful information to it.

In [10]:
df_copy=df.copy()

sub=df_copy

Lets add two columns for tokenized version of review and review_title.

In [11]:
sub['tokenized']=sub['review_comments'].apply(split_text)
sub['review_title_tokenized']=sub['review_title'].apply(split_text)

Lets also create the review sentiment dictionary at this stage based on tokenized version of reviews.

In [134]:
review_sent_dict=review_sentiment_dict(sub)

print 'total entries in the review sentiment dict : ' , len(review_sent_dict.keys())
total entries in the review sentiment dict :  109686
In [136]:
#review_sent_dict

We will now add following more columns to the data:

1. Sentiment words present in review title
2. Sentiment words present in reviews
3. Positive sentiment words
4. Negative sentiment words
5. Degree words
6. COlumns for total words length and total sentiment words length

The following codes will add many more columns to the dataframe

In [137]:
# Function to create more information columns 
def collect(sentence,word_type):
    
    filtered=[]
    for words in sentence:
        
        if words in review_sent_dict:
            wdict=review_sent_dict.get(words)
            #print wdict
            if wdict['type'] == word_type:
                filtered.append(words)
        
    return filtered


def sentiment_polarity(sentence,polarity):
    # (word,word_type,valence,degree)
    # e.g ('clean', 'sentiment', 2.0, 'none')
    filtered=[]
    for words in sentence:
        if words in review_sent_dict:
            wdict=review_sent_dict.get(words)
            word_polarity=wdict['valence']
            if polarity=='pos':
                if word_polarity>=1:
                    filtered.append(words)
            else:
                if word_polarity<0:
                    filtered.append(words)
        
    return filtered
In [138]:
sub['title_sentiment_words']=sub['review_title_tokenized'].apply(collect,args=['sentiment'])

sub['sentiment_words']=sub['tokenized'].apply(collect,args=['sentiment'])
sub['pos_sentiment']=sub['tokenized'].apply(sentiment_polarity,args=['pos'])
sub['neg_sentiment']=sub['tokenized'].apply(sentiment_polarity,args=['neg'])

sub['negation_words']=sub['tokenized'].apply(collect,args=['negation'])
sub['degree_words']=sub['tokenized'].apply(collect,args=['degree'])

sub['total_words']=sub['tokenized'].apply(len)
sub['total_sentiment_words']=sub['sentiment_words'].apply(len)

Lets look at how the dataframe look now :

In [139]:
sub.head()
Out[139]:
hotel_review_id review_title review_comments tokenized review_title_tokenized title_sentiment_words sentiment_words pos_sentiment neg_sentiment negation_words degree_words total_words total_sentiment_words review_score
0 103237986.0 Friendly staff and comfortable stay Continental breakfast is 7/10 [Continental, breakfast, is, 7/10] [Friendly, staff, and, comfortable, stay] [comfortable] [] [] [] [] [] 4 0 0.5000
1 103237985.0 Budget hotel Hotel is OK but the breakfast taste is so so. [Hotel, is, OK, but, the, breakfast, taste, is... [Budget, hotel] [] [] [] [] [] [] 11 0 0.5000
2 103237979.0 Don't set high expectations The hotel has a beautiful lobby, delicately de... [The, hotel, has, a, beautiful, lobby, ,, deli... [Do, n't, set, high, expectations] [] [beautiful, nice, clean, smooth, like, nice, a... [beautiful, nice, clean, smooth, like, nice, a... [stalls, cheap, rip] [wasnt, n't, n't] [] 198 18 0.9834
3 103237975.0 Good location Good location and good view. But check-out tim... [Good, location, and, good, view, ., But, chec... [Good, location] [] [good] [good] [] [] [] 13 1 0.8402
4 103237974.0 Need Renovations The rooms and facilities are old and needs tot... [The, rooms, and, facilities, are, old, and, n... [Need, Renovations] [] [] [] [] [] [] 10 0 0.5000

Some Plots to look at the dataframe information

In [140]:
# histogram of total words per review

plt.figure(1)


x=sub.total_words

plt.subplot(1, 2, 1)
plt.hist(x, bins=50, facecolor='blue')
plt.xlabel('total words in a review')
plt.ylabel('count')
plt.title('total words in a review')


plt.figure(2)
arr=[len(row) for row in sub.sentiment_words]
plt.subplot(1, 2, 1)
plt.hist(arr, bins=20, facecolor='green')
plt.xlabel('sentiment words in a review')
plt.ylabel('count')
plt.title('sentiment words in a review')

plt.figure(3)
arr=[len(row) for row in sub.pos_sentiment]
plt.subplot(1, 2, 1)
plt.hist(arr, bins=20, facecolor='green')
plt.xlabel('positive sentiment words in a review')
plt.ylabel('count')
plt.title('+ve sentiment words in a review')


arr=[len(row) for row in sub.neg_sentiment]
plt.subplot(1, 2, 2)
plt.hist(arr, bins=20, facecolor='red')
plt.xlabel('-ve sentiment words in a review')
plt.ylabel('count')
plt.title('-ve sentiment words in a review')


plt.figure(4)
plt.subplot(1, 2, 1)
arr=[len(row) for row in sub.negation_words]
plt.hist(arr, bins=30, facecolor='orange')
plt.xlabel('negation words in a review')
plt.ylabel('count')
plt.title('negation words in a review')

plt.tight_layout()
plt
Out[140]:
<module 'matplotlib.pyplot' from '/Users/sanjaymeena/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>

We will now work with adding code to assign sentiment scores to reviews . We use following ideas

  1. Positive and negative degree words boosts the sentiment of the modified words
  2. Dealing with negation in the sentence.
  3. Use positive and negative valency of the sentiment words to calculate scores.

There are lot more things to cover in the scoring system which are not done in the report for now.

  • Deal with but clauses, sentence discourses etc.
In [141]:
pos_degree=0.3
neg_degree=0.3

def normalize(sentiments,alpha=15):
    """
    Normalize the score to be between 0 and 1. We have used norm_score function mentioned in one \
    VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text
    
    """
    
    
    minV=min(sentiments)
    maxV=max(sentiments)
     
    score=float(sum(sentiments))    
    norm_score = score/math.sqrt((score*score) + alpha)
    
    #print norm_score
    if minV==maxV==0: #Neutral
        norm_score=0.5
    elif norm_score < 0: 
        norm_score= 0.0
    elif norm_score > 1.0:
        norm_score=1.0
    
    return norm_score

def calculate_polarity_scores(sentence):
    #print sentence
    sentiments = []
    #print sentence
    sent= ' '.join(elem[0] for elem in sentence)
    
    def createSentimentList(sentence):
        for index, elem in enumerate(sentence):
            #print(index, elem)
            valence = 0
            word=elem[0]
            lexicon=elem[1]
            #print lexicon

            if lexicon['type']=='sentiment':
                valence=lexicon['valence']
            elif lexicon['type']=='degree':
                if lexicon['degree']=='incr':
                    valence=pos_degree
                else:
                    valence=neg_degree
            sentiments.append(valence)
        #print sentiments
        return sentiments
    
    
    def organize_sentiment_scores(sentiments):
        # want separate positive versus negative sentiment scores
        pos_sum = 0.0
        neg_sum = 0.0
        neu_count = 0
        for sentiment_score in sentiments:
            if sentiment_score > 0:
                pos_sum += (float(sentiment_score) +1) # compensates for neutral words that are counted as 1
            if sentiment_score < 0:
                neg_sum += (float(sentiment_score) -1) # when used with math.fabs(), compensates for neutrals
            if sentiment_score == 0:
                neu_count += 1
        return pos_sum, neg_sum, neu_count
    
    def score_valence(sentence,sentiments):
        sentiment_dict={}
        #print 'sentiments ' , sentiments
        if sentiments:
            sum_s = float(sum(sentiments))
           
            # discriminate between positive, negative and neutral sentiment scores
            pos_sum, neg_sum, neu_count = organize_sentiment_scores(sentiments)    

            #print  pos_sum, neg_sum, neu_count
            total = pos_sum + math.fabs(neg_sum) + neu_count
            pos = math.fabs(pos_sum / total)
            neg = math.fabs(neg_sum / total)
            neu = math.fabs(neu_count / total)

            
#             compound= normalize(float(pos+(-1.0 *neg) +neu))
            
            compound = normalize(sentiments)
            sentiment_dict = \
            {"neg" : round(neg, 3),
             "neu" : round(neu, 3),
             "pos" : round(pos, 3),
             "compound" : round(compound, 4)}

        return sentiment_dict
    
    
    
    def updateSentimentList(sentence,sentiments):
        "update the sentiment list based on rules"
        
        sentiments=check_for_negation_case(sentence,sentiments)
        
        return sentiments
    

    def check_for_negation_case(sentence,sentiments):
        "check for negation case in a sentence"
        
        
        for index, elem in enumerate(sentence):
            #print(index, elem)
            
            word=elem[0]
            lexicon=elem[1]
            
            valence=lexicon['valence']
            word_type=lexicon['type']
            found_neg_object=False
            if word_type == 'negation':
#                 negation_list[index]=-1
                found_neg_object=True
                lookup_range=range(max(index-1,0), min(index+3,len(sentence)-1))
                #print lookup_range , ' for negation at index : ', index
                
                for i in lookup_range:
                    #print sentiments[i]
                    if sentiments[i] > 0 or sentiments[i] < 0:
                        found_neg_object=True
                        sentiments[i]= -1. * sentiments[i]
                        #print 'negation found and updated ', sentiments[i]
                if not  found_neg_object:
                    sentiments[index]=-1.0
              
        #print negation_list , 'look up range : ', lookup_range, 'updated sentiment list ', sentiments
        
        return sentiments
    
    
    # create the sentiment list on word tokens in sentence
    sentiments=createSentimentList(sentence)   
    
    # update the sentiment list based on rules 
    sentiments=updateSentimentList(sentence,sentiments)
    
    # calculate the final sentiment scores
    scores=score_valence(sentence,sentiments)
    
    #print sent  
    #print sentence
    #print '  '
    #print '-> score: '   , scores
   # print ' '
    
    return scores
In [170]:
# function to word level sentiment
def addWordSentiment(sentence,lower_first_word=True):
    
    sentimentwordsInSentence=[]
    #print 'addWordSentiment ',  sentence
    for index,(word,postag) in enumerate(sentence):
        if lower_first_word and index==0:
            word=word.lower()
        sentimentwordsInSentence.append((word,getSentimentInfo(word,postag))) 
        
   # sentimentwordsInSentence = [(word,getSentimentInfo(word,postag)) for (word, postag) in sentence]
    
    #print sentimentwordsInSentence
    return sentimentwordsInSentence


# adds sentiment info for the word using the sentiment lexicon
def getSentimentInfo(word,pos):
    
    #word=word.lower()
    dictionary={}
    dictionary['pos']=pos
    dictionary['word']=word
    
    if word in review_sent_dict:
        
        dictionary=review_sent_dict[word]
        #print 'word , ' ,dictionary
    else:
#         print 'word not found ', word
        dictionary['type']='none'
        dictionary['valence']=0
        
    return dictionary
In [171]:
def calculate_sentiment(review,review_sentiment_dict):
    "calculate sentiment scores in format : score:  {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.5}"
    
    sentences_with_pos= tag_pos(review) 
    #print sentences_with_pos

    
    #print sentence
    sentence=addWordSentiment(sentences_with_pos,True)
    score=calculate_polarity_scores(sentence)
    return score
    
def calculate_one_sentiment(review,review_sentiment_dict):
    "Calculate one compound sentiment score"
    score={}
    score['compound']=0
    
    if len(review)>=1:
        score=calculate_sentiment(review,review_sentiment_dict)
#     else:
#         print 'no words in review', review
    return score['compound']
In [ ]:
 

lets predict sentiment scores for first 10 reviews

In [172]:
reviews=sub['tokenized'][1:10].tolist()

#string='My stay at hotel was fantastic !'
#reviews=[string.split()]
#print reviews
for review in reviews:
    score =calculate_sentiment(review,review_sentiment_dict)
    sent= ' '.join(elem for elem in review)

    print 'review: ' , sent
    print 'score: ', score
    print ''
    
review:  Hotel is OK but the breakfast taste is so so .
score:  {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.5}

review:  The hotel has a beautiful lobby , delicately designed and tasteful . Everything was nice and clean , even the check in process was smooth . I booked for a junior suite and it wasnt what i was expecting . The room is smaller than it looks like in the pictures , but that is how it always is . Furnitures are abit aged . But the toilet and the bathtub is such a nice place to relax in . The hotel is nearby a stretch of authentic local food stalls and it is a must to visit especially in the evening once the malls and shops are closed ! Trust me its a great experience and who does n't love cheap delicious street food ? Ask the concierge and they will lead you the way , however the person that assisted us did n't really explain the directions properly so we ended up walking a whole big round . Hotel is accessible nearby to mall and even batam ferry centre ( walking distance ) Do opt for their taxi transfer if you need to , because the taxis even from the mall straight are rip offs !
score:  {'neg': 0.035, 'neu': 0.789, 'pos': 0.175, 'compound': 0.9818}

review:  Good location and good view . But check-out time is too soon .
score:  {'neg': 0.0, 'neu': 0.579, 'pos': 0.421, 'compound': 0.8402}

review:  The rooms and facilities are old and needs total renovation
score:  {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.5}

review:  Good all round hotel though a little difficult to find the front entrance .
score:  {'neg': 0.111, 'neu': 0.667, 'pos': 0.222, 'compound': 0.4588}

review:  only complaint was the number of weddings conducted during our stay which
score:  {'neg': 0.214, 'neu': 0.786, 'pos': 0.0, 'compound': 0.0}

review:  sauna did not work and hot tub was too hot .
score:  {'neg': 0.143, 'neu': 0.571, 'pos': 0.286, 'compound': 0.25}

review:  Decent hotel at good location . Could improve on cleanliness . Facilities are adequate .
score:  {'neg': 0.0, 'neu': 0.435, 'pos': 0.565, 'compound': 0.9001}

review:  Value for money if you 're looking for a comfortable place . Breakfast spread is simple and sufficient if you 're fuss-free . Suitable for family or friends .
score:  {'neg': 0.0, 'neu': 0.844, 'pos': 0.156, 'compound': 0.6124}

lets Tag the review_title and reviews with the sentiment scores now. We will also tag review_title to see if we can find some insight

In [162]:
%%time

reviews_with_scores=sub

reviews_with_scores=reviews_with_scores
reviews_with_scores['review_score']= reviews_with_scores['tokenized'].apply(calculate_one_sentiment,args=[review_sentiment_dict])

Lets also tag the review titles with sentiment scores

In [174]:
%%time
reviews_with_scores['review_title_score']= reviews_with_scores['review_title_tokenized'].apply(calculate_one_sentiment,args=[review_sent_dict])
CPU times: user 2min 11s, sys: 3.94 s, total: 2min 15s
Wall time: 2min 16s
In [183]:
save_reviews_with_scores='./data/reviews_with_scores.csv'
save_df=reviews_with_scores[['hotel_review_id','review_title','review_title_score','review_comments','review_score']]
save_df.to_csv(save_reviews_with_scores,index=False)
In [185]:
result=pd.read_csv(save_reviews_with_scores)
result.head()
Out[185]:
hotel_review_id review_title review_title_score review_comments review_score
0 103237986.0 Friendly staff and comfortable stay 0.7184 Continental breakfast is 7/10 0.5000
1 103237985.0 Budget hotel 0.5000 Hotel is OK but the breakfast taste is so so. 0.5000
2 103237979.0 Don't set high expectations 0.5000 The hotel has a beautiful lobby, delicately de... 0.9818
3 103237975.0 Good location 0.6124 Good location and good view. But check-out tim... 0.8402
4 103237974.0 Need Renovations 0.5000 The rooms and facilities are old and needs tot... 0.5000
In [197]:
A=result['review_score'].tolist()
B=result['review_title_score'].tolist()


A=np.asarray(A)
B=np.asarray(B)
In [198]:
print "difference:", A - B
print "SAD:", np.sum(np.abs(A - B))
print "SSD:", np.sum(np.square(A - B))
print "correlation:", np.corrcoef(np.array((A, B)))[0, 1]
difference: [-0.2184  0.      0.4818 ...,  0.2088  0.1566  0.2333]
SAD: 65635.3117
SSD: 27204.9578292
correlation: 0.350834261606
In [204]:
import scipy
scipy.stats.pearsonr(A, B)
Out[204]:
(0.35083426160561737, 0.0)
In [206]:
plt.plot(A-B)
#plt.plot(x, y)
plt.show()
In [ ]: