chat_analyzer package
Submodules
chat_analyzer.analyzer module
- chat_analyzer.analyzer.check_chatlog_downloader_supported(chatlog: Chat, url: str)
Ensures that we are able to properly analyze the downloaded chatlog by enforcing the metadata matches expected values. (Our own logic depends on certain things being true)
If there is a breach of compliance, we exit (we consider this to be a fatal error)
- Parameters
chatlog (chat_downloader.sites.common.Chat) – The chatlog we have downloaded
url (str) – The URL of the video we have downloaded the log from
- Currently we ensure:
The video/stream happened in the past (not currently live or scheduled)
The chat is downloaded from a supported platform that we have a proper
way of parsing.
- chat_analyzer.analyzer.get_ChatAnalytics_from_file(filepath: str)
Given a path to a previous output file of this program containing analytical data, extract the data into a ChatAnalytics object and return it.
- Parameters
filepath (str) – a path to a previous output file of this program containing analytical data
- Returns
a ChatAnalytics object
- Return type
- chat_analyzer.analyzer.get_chatlog_downloader(url: str)
Gets a chat-downloader generator using Xenonva’s chat-downloader
- Parameters
url – The URL of the past stream/VOD to download the chat from
type – str
- Returns
The chatlog we have downloaded
- Return type
chat_downloader.sites.common.Chat
- chat_analyzer.analyzer.get_chatmsgs_from_chatfile(filepath: str)
Given a path to a chatfile (directly from Xenovas downloader or produced by –save-chatfile-output) flag, extract the chat messages from the file into an array of chat messages.
- Parameters
filepath – a path to a chatfile (directly from Xenovas downloader or produced by
–save-chatfile-output) flag :type filepath: str :returns: an array of chat messages :rtype: list[chat_downloader.sites.common.ChatMessage]
- chat_analyzer.analyzer.output_json_to_file(json_obj, filepath)
- chat_analyzer.analyzer.run(**kwargs)
Runs the chat-analyzer
- Returns
The chat analytics data as a dataclass
- Return type
dataformat.ChatAnalytics (dataformat.YoutubeChatAnalytics or dataformat.TwitchChatAnalytics)
chat_analyzer.cli module
- class chat_analyzer.cli.SmartFormatter(prog, indent_increment=2, max_help_position=24, width=None)
Bases:
ArgumentDefaultsHelpFormatter
Any help string starting with ‘R|’ has its newlines (
- ) preserved, in addition to
keeping the fxnality from the HelpFormatter (displaying defaults next to descriptions).
Adapted from and thanks to: https://stackoverflow.com/questions/3853722/how-to-insert-newlines-on-argparse-help-text
- chat_analyzer.cli.check_interval(interval)
Based on the MAX_INTERVAL and MIN_INTERVAL from chat_analyzer.py, ensure that the entered interval respects that boundary
- chat_analyzer.cli.check_percentile_float(value)
Check that the value is between 0 and 100 exclusive
- chat_analyzer.cli.check_positive_int(value)
Check that the value is a positive integer
- chat_analyzer.cli.main()
chat_analyzer.dataformat module
- class chat_analyzer.dataformat.ChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None)
Bases:
ABC
Class that contains the results of the chat data analysis/processing.
An instance of a subclass is created and then modified throughout the analysis process. After the processing of the data is complete, the object will contain all relevant results we are looking for.
This class cannot be directly instantiated, see the subclasses YoutubeChatAnalytics & TwitchChatAnalytics. YT and Twitch chats report/record data differently and contain site-specific events, so we centralize common data/fxnality and separate specifics into subclasses.
The object can then be converted to JSON/printed/manipulated as desired to format/output the results as necessary.
—
[Defined when class Initialized]:
- duration: float
The total duration (in seconds) of the associated video/media. Message times correspond to the video times.
- interval: int
The time interval (in seconds) at which to compress datapoints into samples. i.e. Duration of the samples. The smaller the interval, the more granular the analytics are. At interval=5, each sample contains 5 seconds of cumulative data. (With the exception of the last sample, which may be shorter than the interval.) This is b/c media duration is not necessarily divisible by the interval. #(samples in raw_data) is about (video duration/interval) (+1 if necessary to encompass remaining non-divisible data at end of data).
- description: str
A description included to help distinguish it from other analytical data.
- program_version: str
The version of the chat analytics program that was used to generate the data. Helps identify outdated/version-specific data formats.
- platform: str
Used to store the platform the data came from: ‘www.youtube.com’, ‘www.twitch.tv’, ‘youtu.be’… While it technically can be determined by the type of subclass, this makes for easier conversion to JSON/output
[Automatically re-defined on post-init]:
- duration_text: str
String representation of the media duration time.
- interval_text: str
String representation of the interval time.
[Defined w/ default and modified DURING analysis]:
- mediaTitle: str
The title of the media associated with the chatlog.
- mediaSource: str
The link to the media associated with the chatlog (url that it was origianlly downloaded from or filepath of a chatfile).
- samples: List[Sample]
An array of sequential samples, each corresponding to data about a section of chat of ‘interval’ seconds long. Each sample has specific data corresponding to a time interval of the vid. See the ‘Sample’ class
- totalActivity: int
The total number of messages/things (of any type!) that appeared in chat. (Sum of intervalActivity from all samples) Includes messages,notifications,subscriptions, superchats, … anything that appeared in chat
- totalChatMessages: int
The total number of chats sent by human (non-system) users (what is traditionally thought of as a chat) NOTE: Difficult to discern bots from humans other than just creating a known list of popular bots and blacklisting, because not all sites (YT/Twitch) provide information on whether chat was sent by a registered bot or not.
- highlight_percentile: float
The cutoff percentile that samples must meet to be considered a highlight
- highlight_metric: str
The metric to use for engagement analysis to build highlights. NOTE: must be converted into actual Sample field name before use.
[Defined w/ default and modified AFTER analysis]:
- totalUniqueUsers: int
The total number of unique users that sent a chat message (human users that sent at least one traditional chat)
- overallAvgActivityPerSecond: float
The average activity per second across the whole chatlog. (totalActivity/totalDuration)
- overallAvgChatMessagesPerSecond: float
The average number of chat messages per second across the whole chatlog. (totalChatMessages/totalDuration)
- overallAvgUniqueUsersPerSecond: float
The average number of unique users chatting per second.
- highlights: List[Highlight]
A list of the high engagement sections of the chatlog.
- highlights_duration: float
The cumulative duration of the highlights (in seconds)
- highlights_duration_text: str
The cumulative duration of the highlights represented in text format (i.e. hh:mm:ss)
- spikes: List[Spike]
Not yet implemented TODO A list of the calculated spikes in the chatlog. May contain spikes of different types, identifiable by the spike’s type field.
- chatlog_post_process(settings: ProcessSettings)
After we have finished iterating through the chatlog and constructing all of the samples, we call chatlog_post_process() to process the cumulative data points (so we don’t have to do this every time we add a sample).
This step is sometimes referred to as “analysis”.
Also removes the internal fields that don’t need to be output in the JSON object.
- Parameters
settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor
- create_new_sample()
Post-processes the previous sample, then appends & creates a new sample following the previous sample sequentially. If a previous sample doesn’t exist, creates the first sample.
NOTE: If there there are only 2 chats, one at time 0:03, and the other at 5:09:12, there are still a lot of empty samples in between (because we still want to graph/track the silence times with temporal stability)
- description: str
- duration: float
- duration_text: str = ''
- get_highlights(highlight_metric: str, highlight_percentile: float)
Highlights reference a contiguous period of time where the provided metric remains above the percentile threshold. Find and return a list of highlights referencing the start and end times of samples whose highlight_metric is in the highlight_percentile for contiguous period of time of the referenced samples.
A highlight may reference more than one sample if contiguous samples meet the percentile cutoff.
Samples in the top ‘percentile’% of the selected engagement metric will be considered high-engagement samples and included in the highlights output list. The larger the percentile, the greater the metric requirement before being reported. If ‘engagement-percentile’=93.0, any sample in the 93rd percentile (top 7.0%%) of the selected metric will be considered an engagement highlight.
These high-engagement portions of the chatlog are stored as highlights, and may last for multiple samples.
This method should only be called after the averages have been calculated, ensuring accurate results when determining periods of high engagement.
- Parameters
highlight_metric – The metric samples are compared to determine if they are high-engagement samples. NOTE: Internally converted to the actual field name of a sample field.
highlight_percentile – The cutoff percentile that the samples must meet to be included in a highlight
- Returns
a list of highlights referencing samples that met the percentile cutoff requirements for the provided metric
- Return type
List[Highlight]
- get_spikes(spike_sensitivity, spike_metric)
A spike is a point in the chatlog where from one sample to the next, there is a sharp increase in the provided metric.
…? Are spikes sustained or..? ?: A spike is a point in the chatlog where the activity is significantly different from the average activity. Activity is significantly different if it is > avg*SPIKE_MULT_THRESHOLD. We detect a spike if the high activity level is maintained for at least SPIKE_SUSTAIN_REQUIREMENT # of samples.
- highlight_metric: str = ''
- highlight_percentile: float = 0
- highlights_duration: float = 0
- highlights_duration_text: str = ''
- interval: int
- interval_text: str = ''
- mediaSource: str = 'No Media Source'
- mediaTitle: str = 'No Media Title'
- overallAvgActivityPerSecond: float = 0
- overallAvgChatMessagesPerSecond: float = 0
- overallAvgUniqueUsersPerSecond: float = 0
- platform: str
- print_process_progress(msg, idx, finished=False)
Prints progress of the chat download/process to the console.
If finished is true, normal printing is skipped and the last bar of progress is printed. This is important because we print progress every UPDATE_PROGRESS_INTERVAL messages, and the total number of messages is not usually divisible by this. We therefore have to slightly change the approach to printing progress for this special case.
- process_chatlog(chatlog: Chat, source: str, settings: ProcessSettings)
Iterates through the whole chatlog and calculates the analytical data (Modifies and stores in a ChatAnalytics object).
- Parameters
chatlog (chat_downloader.sites.common.Chat) – The chatlog we have downloaded
source (str) – The source of the media associated w the chatlog. URL of the media we have downloaded the log from, or a filepath
settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor
- process_message(msg)
Given a msg object from chat, update appropriate statistics based on the chat
- program_version: str
- to_JSON()
- totalActivity: int = 0
- totalChatMessages: int = 0
- totalUniqueUsers: int = 0
- class chat_analyzer.dataformat.Highlight(startTime: float, endTime: float, description: str, type: str, peak: float, avg: float)
Bases:
Section
Highlights reference a contiguous period of time where the provided metric remains above the percentile threshold.
—
- type: str
The engagement metric. i.e. “avgActivityPerSecond”, “avgChatMessagesPerSecond”, “avgUniqueUsersPerSecond”, etc. NOTE: It is stored as its converted value (the name of the actual field), NOT the metric str the user provided in the CLI.
- peak: float
The maximum value of the engagement metric throughout the whole Highlight (among the samples in the Highlight).
- avg: float
The average value of the engagement metric throughout the whole Highlight (among the samples in the Highlight).
- avg: float
- peak: float
- type: str
- class chat_analyzer.dataformat.ProcessSettings(print_interval: int, msg_break: int, highlight_percentile: float, highlight_metric: str, spike_sensitivity: float)
Bases:
object
Utility class for passing information from the analyzer to the chatlog processor and post-processor
- print_interval: int
After ever ‘progress_interval’ messages, print a progress message. If <=0, progress printing is disabled
- msg_break: int
(Mainly for Debug) Stop processing messages after BREAK number of messages have been processed.
- highlight_percentile: float
The cutoff percentile that samples must meet to be considered a highlight
- highlight_metric: str
The metric to use for engagement analysis to build highlights. NOTE: must be converted into actual Sample field name before use.
- spike_sensitivity: float
How sensitive the spike detector is at picking up spikes. Higher sensitivity means more spikes are detected.
- highlight_metric: str
- highlight_percentile: float
- msg_break: int
- print_interval: int
- spike_sensitivity: float
- class chat_analyzer.dataformat.Sample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>)
Bases:
object
Class that contains data of a specific time interval of the chat. Messages will be included in a sample if they are contained within [startTime, endTime)
—
[Defined when class Initialized]:
- startTime: float
The start time (inclusive) (in seconds) corresponding to a sample.
- endTime: float
The end time (exclusive) (in seconds) corresponding to a sample.
[Automatically Defined on init]:
- startTime_text: str
The start time represented in text format (i.e. hh:mm:ss)
- endTime_text: str
The end time represented in text format (i.e. hh:mm:ss)
- sampleDuration: float
The duration (in seconds) of the sample (end-start) NOTE: Should be == to the selected interval in all except the last sample if the total duration of the chat is not divisible by the interval
[Defined w/ default and modified DURING analysis of sample]:
- activity: int
The total number of messages/things (of any type!) that appeared in chat within the start/endTime of this sample. Includes messages,notifications,subscriptions, superchats, … anything that appeared in chat
- chatMessages: int
The total number of chats sent by human (non-system) users (what is traditionally thought of as a chat) NOTE: Difficult to discern bots from humans other than just creating a known list of popular bots and blacklisting, because not all sites (YT/Twitch) provide information on whether chat was sent by a registered bot or not.
- firstTimeChatters: int
The total number of users who sent their first message of the whole stream during this sample interval
[Defined w/ default and modified AFTER analysis of sample]:
- uniqueUsers: int
The total number of unique users that sent a chat message across this sample interval (len(self._userChats))
- avgActivityPerSecond: float
The average activity per second across this sample interval. (activity/sampleDuration)
- avgChatMessagesPerSecond: float
The average number of chat messages per second across this sample interval. (totalChatMessages/sampleDuration)
- avgUniqueUsersPerSecond: float
The average number of unique users that sent a chat across this sample interval. (uniqueUsers/sampleDuration)
- activity: int = 0
- avgActivityPerSecond: float = 0
- avgChatMessagesPerSecond: float = 0
- avgUniqueUsersPerSecond: float = 0
- chatMessages: int = 0
- endTime: float
- endTime_text: str = ''
- firstTimeChatters: int = 0
- sampleDuration: float = -1
- sample_post_process()
After we have finished adding messages to a particular sample (moving on to the next sample), we call sample_post_process() to process the cumulative data points (so we don’t have to do this every time we add a message)
Also removes the internal fields that don’t need to be output in the JSON object.
- startTime: float
- startTime_text: str = ''
- uniqueUsers: int = 0
- class chat_analyzer.dataformat.Section(startTime: float, endTime: float, description: str)
Bases:
object
Contains generic information about a noteable section of the chatlog
—
[Defined when class Initialized]:
- startTime: float
The start time (inclusive) (in seconds) corresponding to a section.
- endTime: float
The end time (exclusive) (in seconds) corresponding to a section.
- description: str (optional)
A description of the section (if any).
[Automatically re-defined on post-init]:
- duration: float
The duration (in seconds) of the section (end-start)
- duration_text: str
The duration represented in text format (i.e. hh:mm:ss)
- startTime_text: str
The start time represented in text format (i.e. hh:mm:ss)
- endTime_text: str
The end time represented in text format (i.e. hh:mm:ss)
- description: str
- duration: float = 0.0
- duration_text: str = ''
- endTime: float
- endTime_text: str = ''
- startTime: float
- startTime_text: str = ''
- class chat_analyzer.dataformat.Spike(startTime: float, endTime: float, description: str)
Bases:
Section
Contains information about an activity spike in the chatlog
TODO: Implement
- description: str
- endTime: float
- startTime: float
- class chat_analyzer.dataformat.TwitchChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None, totalSubscriptions: int = 0, totalGiftSubscriptions: int = 0, totalUpgradeSubscriptions: int = 0)
Bases:
ChatAnalytics
Extension of the ChatAnalytics class, meant to contain data that all chats have and data specific to Twitch chats.
NOTE: Most twitch-specific attributes don’t make a lot of sense to continously report a per-second value, so we don’t!
—
(See ChatAnalytics class for common fields)
[Defined w/ default and modified DURING analysis]:
- totalSubscriptions: int
The total number of subscriptions that appeared in the chat (which people purchased themselves).
- totalGiftSubscriptions: int
The total number of gift subscriptions that appeared in the chat.
- totalUpgradeSubscriptions: int
The total number of upgraded subscriptions that appeared in the chat.
- chatlog_post_process(settings)
After we have finished iterating through the chatlog and constructing all of the samples, we call chatlog_post_process() to process the cumulative data points (so we don’t have to do this every time we add a sample).
This step is sometimes referred to as “analysis”.
Also removes the internal fields that don’t need to be output in the JSON object.
- Parameters
settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor
- process_message(msg)
Given a msg object from chat, update common fields and twitch-specific fields
- to_JSON()
- totalGiftSubscriptions: int = 0
- totalSubscriptions: int = 0
- totalUpgradeSubscriptions: int = 0
- class chat_analyzer.dataformat.TwitchSample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>, subscriptions: int = 0, giftSubscriptions: int = 0, upgradeSubscriptions: int = 0)
Bases:
Sample
Class that contains data specific to Twitch of a specific time interval of the chat.
—
[Defined w/ default and modified DURING analysis of sample]:
- subscriptions: int
The total number of subscriptions (that people purhcased themselves) that appeared in chat within the start/endTime of this sample.
- giftSubscriptions: int
The total number of gift subscriptions that appeared in chat within the start/endTime of this sample.
- upgradeSubscriptions: int
The total number of upgraded subscriptions that appeared in chat within the start/endTime of this sample.
- giftSubscriptions: int = 0
- subscriptions: int = 0
- upgradeSubscriptions: int = 0
- class chat_analyzer.dataformat.YoutubeChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None, totalSuperchats: int = 0, totalMemberships: int = 0)
Bases:
ChatAnalytics
Extension of the ChatAnalytics class, meant to contain data that all chats have and data specific to YouTube chats.
NOTE: Most youtube-specific attributes don’t make a lot of sense to continously report a per-second value, so we don’t!
—
(See ChatAnalytics class for common fields and descriptions)
[Defined w/ default and modified DURING analysis]:
- totalSuperchats: int
The total number of superchats (regular/ticker) that appeared in the chat. NOTE: A creator doesn’t necessarily care what form a superchat takes, so we just combine regular and ticker superchats
- totalMemberships: int
The total number of memberships that appeared in the chat.
- process_message(msg)
Given a msg object from chat, update common fields and youtube-specific fields
- to_JSON()
- totalMemberships: int = 0
- totalSuperchats: int = 0
- class chat_analyzer.dataformat.YoutubeSample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>, superchats: int = 0, memberships: int = 0)
Bases:
Sample
Class that contains data specific to Youtube of a specific time interval of the chat.
—
[Defined w/ default and modified DURING analysis of sample]:
- superchats: int
The total number of superchats (regular/ticker) that appeared in chat within the start/endTime of this sample. NOTE: A creator doesn’t necessarily care what form a superchat takes, so we just combine regular and ticker superchats
- memberships: int
The total number of memberships that appeared in chat within the start/endTime of this sample.
- memberships: int = 0
- superchats: int = 0
chat_analyzer.metadata module
Set metadata for chat-analyzer
chat_analyzer.util module
- chat_analyzer.util.dprint(should_print: bool, s: str)
Simple styled debug printer
- chat_analyzer.util.remove_non_alpha_numeric(s: str) str
Remove non-alphanumeric characters from a string and replaces all spacebars with underscores (Useful for normalizing the title of a video before turning it into a filename)
Module contents
Top-level package for chat-analyzer.