The detailed results and steps are given in the .ipynb
A project that conducts sentiment analysis on WhatsApp messages, exploring patterns and intensity of communication throughout the day.
To get started, clone the project and navigate to the project directory. Install the required dependencies:
git clone https://github.com/ecetsn/CS210_Term_Project.git
cd CS210_Term_Project/
pip install -r requirements.txt
- Python
- Jupyter Notebook
- Python (>=3.6)
- Jupyter Notebook
- ZEMBEREK Turkish NLP library
- BERT-based Turkish sentiment analysis model
- Install the necessary libraries mentioned in the notebook.
!pip install pandas
!pip install transformers
!pip install zemberek-nlp
!pip install plotly
!pip install seaborn
!pip install matplotlib
- Ensure the availability of WhatsApp data for analysis.
- Run the notebook in a Jupyter environment, following each step.
No additional building steps are required for this project.
No specific deployment steps are needed as this project primarily focuses on analysis and exploration.
No user-configurable parameters. The configuration involves installing the required dependencies and setting up the environment.
No external API is used. The project primarily utilizes Python libraries for sentiment analysis and spell-checking.
- Data Collection
- Collecting conversational data from WhatsApp, ensuring the inclusion of timestamps for each interaction.
- Data Cleaning
- Handle missing values
- Convert to lowercase
- Remove special characters
- Remove links
- Tokenization
- Remove stopwords
- Spell checking
- Set the environment for Spell Checking using ZEMBEREK Turkish NLP
- Sentiment Analysis
- Set the environment for Sentiment Analysis using BERT-based Turkish Mode
After the data collection, the intensity of the conversation is observed according to time and date
- Get the intensity of messages
- Observe them by the periods of the day
- Combine this knowledge with the sentiment score of the messages
In this analysis, I examine the trends in weighted sentiment over time by grouping the data based on date and time slots. The weighted sentiment is calculated and averaged for each group, resulting in a pivot table that provides insights into sentiment patterns during different times of the day.
In this analysis, a table containing sentiment scores for every message is presented. The sentiment scores were calculated using the BERT-Turkish model. This table serves as the foundation for hypothesis testing to explore patterns and trends in sentiment across different time slots.
- The table provides a comprehensive view of sentiment scores for each message. These scores include both positive and negative values, allowing for a detailed examination of the sentiment distribution.
- The table provides a comprehensive view of positive sentiment scores for each message
- The table provides a comprehensive view of negative sentiment scores for each message
These findings on sentiment correlation are used to test my hypothesis
- There is no significant correlation between the selected periods of the day.
- There is a significant correlation between the selected periods of the day.
- In this section, I employ the Pearson correlation test to assess the correlation between different variables.
The
pearsonr
function is utilized to calculate both correlation coefficients and associated p-values for hypothesis testing.
- The Pearson correlation test is employed to understand the strength and direction of the linear relationship between two variables. This analysis provides insight into whether changes in one variable are associated with systematic changes in another.
-
The code snippet utilizes the
pearsonr
function from thescipy.stats
library to conduct hypothesis testing based on the Pearson correlation coefficient. This coefficient is employed to measure the linear relationship between two variables, providing insights into the strength and direction of their association. The accompanying p-value assists in evaluating the statistical significance of the observed correlation. -
In this context, the null hypothesis posits that there is no correlation between the specified pairs of variables. The code then calculates the p-value associated with the correlation coefficients for morning vs. night average sentiment and evening vs. noon average sentiment. The significance level (alpha) is set to 0.05, a commonly used threshold in hypothesis testing.
-
The subsequent evaluation of results involves comparing the computed p-values with the chosen significance level. If a p-value is less than the significance level, it suggests that there is a statistically significant correlation between the respective pairs of variables. Conversely, if the p-value exceeds the significance level, the conclusion is that there is no significant correlation.
- There is no significant correlation between morning average sentiment and night average sentiment.
- There is no significant correlation between morning average sentiment and evening average sentiment.
- There is no significant correlation between morning average sentiment and noon average sentiment.
- There is no significant correlation between night average sentiment and evening average sentiment.
- There is no significant correlation between night average sentiment and noon average sentiment.
- There is a significant correlation between evening average sentiment and noon average sentiment.
The lack of significant correlations in most comparisons suggests that the sentiment during one-time slot is generally independent of the sentiment during other time slots. This indicates that factors influencing sentiment may vary throughout the day. However, the significant correlation between evening and noon sentiment implies a potential pattern or similarity in sentiment during these specific time slots. Further investigation into the nature of this correlation may provide valuable insights into factors influencing sentiment during these times. This analysis lays the groundwork for understanding the temporal dynamics of sentiment and can guide future explorations or targeted interventions during specific time slots.
The total number of messages during different time slots follows the following ranks:
- Noon has the highest total number of messages, indicating that this time slot is the most active in terms of messaging intensity.
- The evening comes next in terms of total messages, suggesting a considerable level of communication during this period.
- Nighttime exhibits a lower but significant total number of messages, signifying a notable level of activity during nighttime hours.
- Morning shows the lowest total number of messages among the time slots, indicating relatively lower messaging activity during the morning hours.
The observed ranks in messaging intensity provide valuable insights into my messaging habits. The highest activity during noon and evening may be influenced by various factors such as work schedules, social interactions, or personal preferences. The lower messaging intensity in the morning could be attributed to factors like work commitments or the start of the day.
Understanding these patterns can help me manage my communication effectively and adapt to the natural rhythm of my messaging behaviour throughout the day.
This project is licensed under the MIT License. For details, refer to the LICENSE file.