chat_analyzer package

Submodules

chat_analyzer.analyzer module

chat_analyzer.analyzer.check_chatlog_downloader_supported(chatlog: Chat, url: str)

Ensures that we are able to properly analyze the downloaded chatlog by enforcing the metadata matches expected values. (Our own logic depends on certain things being true)

If there is a breach of compliance, we exit (we consider this to be a fatal error)

Parameters

chatlog (chat_downloader.sites.common.Chat) – The chatlog we have downloaded
url (str) – The URL of the video we have downloaded the log from

Currently we ensure:

The video/stream happened in the past (not currently live or scheduled)
The chat is downloaded from a supported platform that we have a proper

way of parsing.

chat_analyzer.analyzer.get_ChatAnalytics_from_file(filepath: str)

Given a path to a previous output file of this program containing analytical data, extract the data into a ChatAnalytics object and return it.

Parameters: filepath (str) – a path to a previous output file of this program containing analytical data
Returns: a ChatAnalytics object
Return type: ChatAnalytics

chat_analyzer.analyzer.get_chatlog_downloader(url: str)

Gets a chat-downloader generator using Xenonva’s chat-downloader

Parameters

url – The URL of the past stream/VOD to download the chat from
type – str

Returns

The chatlog we have downloaded

Return type

chat_downloader.sites.common.Chat

chat_analyzer.analyzer.get_chatmsgs_from_chatfile(filepath: str)

Given a path to a chatfile (directly from Xenovas downloader or produced by –save-chatfile-output) flag, extract the chat messages from the file into an array of chat messages.

Parameters: filepath – a path to a chatfile (directly from Xenovas downloader or produced by

–save-chatfile-output) flag :type filepath: str :returns: an array of chat messages :rtype: list[chat_downloader.sites.common.ChatMessage]

chat_analyzer.analyzer.output_json_to_file(json_obj, filepath)

chat_analyzer.analyzer.run(**kwargs)

Runs the chat-analyzer

Returns: The chat analytics data as a dataclass
Return type: dataformat.ChatAnalytics (dataformat.YoutubeChatAnalytics or dataformat.TwitchChatAnalytics)

chat_analyzer.cli module

class chat_analyzer.cli.SmartFormatter(prog, indent_increment=2, max_help_position=24, width=None)

Bases: ArgumentDefaultsHelpFormatter

Any help string starting with ‘R|’ has its newlines (

) preserved, in addition to

keeping the fxnality from the HelpFormatter (displaying defaults next to descriptions).

Adapted from and thanks to: https://stackoverflow.com/questions/3853722/how-to-insert-newlines-on-argparse-help-text

chat_analyzer.cli.check_interval(interval): Based on the MAX_INTERVAL and MIN_INTERVAL from chat_analyzer.py, ensure that the entered interval respects that boundary

chat_analyzer.cli.check_percentile_float(value): Check that the value is between 0 and 100 exclusive

chat_analyzer.cli.check_positive_int(value): Check that the value is a positive integer

chat_analyzer.cli.main()

chat_analyzer.dataformat module

class chat_analyzer.dataformat.ChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None)

Bases: ABC

Class that contains the results of the chat data analysis/processing.

An instance of a subclass is created and then modified throughout the analysis process. After the processing of the data is complete, the object will contain all relevant results we are looking for.

This class cannot be directly instantiated, see the subclasses YoutubeChatAnalytics & TwitchChatAnalytics. YT and Twitch chats report/record data differently and contain site-specific events, so we centralize common data/fxnality and separate specifics into subclasses.

The object can then be converted to JSON/printed/manipulated as desired to format/output the results as necessary.

—

[Defined when class Initialized]:

duration: float: The total duration (in seconds) of the associated video/media. Message times correspond to the video times.
interval: int: The time interval (in seconds) at which to compress datapoints into samples. i.e. Duration of the samples. The smaller the interval, the more granular the analytics are. At interval=5, each sample contains 5 seconds of cumulative data. (With the exception of the last sample, which may be shorter than the interval.) This is b/c media duration is not necessarily divisible by the interval. #(samples in raw_data) is about (video duration/interval) (+1 if necessary to encompass remaining non-divisible data at end of data).
description: str: A description included to help distinguish it from other analytical data.
program_version: str: The version of the chat analytics program that was used to generate the data. Helps identify outdated/version-specific data formats.
platform: str: Used to store the platform the data came from: ‘www.youtube.com’, ‘www.twitch.tv’, ‘youtu.be’… While it technically can be determined by the type of subclass, this makes for easier conversion to JSON/output

[Automatically re-defined on post-init]:

duration_text: str: String representation of the media duration time.
interval_text: str: String representation of the interval time.

[Defined w/ default and modified DURING analysis]:

mediaTitle: str: The title of the media associated with the chatlog.
mediaSource: str: The link to the media associated with the chatlog (url that it was origianlly downloaded from or filepath of a chatfile).
samples: List[Sample]: An array of sequential samples, each corresponding to data about a section of chat of ‘interval’ seconds long. Each sample has specific data corresponding to a time interval of the vid. See the ‘Sample’ class
totalActivity: int: The total number of messages/things (of any type!) that appeared in chat. (Sum of intervalActivity from all samples) Includes messages,notifications,subscriptions, superchats, … anything that appeared in chat
totalChatMessages: int: The total number of chats sent by human (non-system) users (what is traditionally thought of as a chat) NOTE: Difficult to discern bots from humans other than just creating a known list of popular bots and blacklisting, because not all sites (YT/Twitch) provide information on whether chat was sent by a registered bot or not.
highlight_percentile: float: The cutoff percentile that samples must meet to be considered a highlight
highlight_metric: str: The metric to use for engagement analysis to build highlights. NOTE: must be converted into actual Sample field name before use.

[Defined w/ default and modified AFTER analysis]:

totalUniqueUsers: int: The total number of unique users that sent a chat message (human users that sent at least one traditional chat)
overallAvgActivityPerSecond: float: The average activity per second across the whole chatlog. (totalActivity/totalDuration)
overallAvgChatMessagesPerSecond: float: The average number of chat messages per second across the whole chatlog. (totalChatMessages/totalDuration)
overallAvgUniqueUsersPerSecond: float: The average number of unique users chatting per second.
highlights: List[Highlight]: A list of the high engagement sections of the chatlog.
highlights_duration: float: The cumulative duration of the highlights (in seconds)
highlights_duration_text: str: The cumulative duration of the highlights represented in text format (i.e. hh:mm:ss)
spikes: List[Spike]: Not yet implemented TODO A list of the calculated spikes in the chatlog. May contain spikes of different types, identifiable by the spike’s type field.

chatlog_post_process(settings: ProcessSettings)

After we have finished iterating through the chatlog and constructing all of the samples, we call chatlog_post_process() to process the cumulative data points (so we don’t have to do this every time we add a sample).

This step is sometimes referred to as “analysis”.

Also removes the internal fields that don’t need to be output in the JSON object.

Parameters: settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor

create_new_sample()

Post-processes the previous sample, then appends & creates a new sample following the previous sample sequentially. If a previous sample doesn’t exist, creates the first sample.

NOTE: If there there are only 2 chats, one at time 0:03, and the other at 5:09:12, there are still a lot of empty samples in between (because we still want to graph/track the silence times with temporal stability)

description: str

duration: float

duration_text: str = ''

get_highlights(highlight_metric: str, highlight_percentile: float)

Highlights reference a contiguous period of time where the provided metric remains above the percentile threshold. Find and return a list of highlights referencing the start and end times of samples whose highlight_metric is in the highlight_percentile for contiguous period of time of the referenced samples.

A highlight may reference more than one sample if contiguous samples meet the percentile cutoff.

Samples in the top ‘percentile’% of the selected engagement metric will be considered high-engagement samples and included in the highlights output list. The larger the percentile, the greater the metric requirement before being reported. If ‘engagement-percentile’=93.0, any sample in the 93rd percentile (top 7.0%%) of the selected metric will be considered an engagement highlight.

These high-engagement portions of the chatlog are stored as highlights, and may last for multiple samples.

This method should only be called after the averages have been calculated, ensuring accurate results when determining periods of high engagement.

Parameters

highlight_metric – The metric samples are compared to determine if they are high-engagement samples. NOTE: Internally converted to the actual field name of a sample field.
highlight_percentile – The cutoff percentile that the samples must meet to be included in a highlight

Returns

a list of highlights referencing samples that met the percentile cutoff requirements for the provided metric

Return type

List[Highlight]

get_spikes(spike_sensitivity, spike_metric)

A spike is a point in the chatlog where from one sample to the next, there is a sharp increase in the provided metric.

…? Are spikes sustained or..? ?: A spike is a point in the chatlog where the activity is significantly different from the average activity. Activity is significantly different if it is > avg*SPIKE_MULT_THRESHOLD. We detect a spike if the high activity level is maintained for at least SPIKE_SUSTAIN_REQUIREMENT # of samples.

highlight_metric: str = ''

highlight_percentile: float = 0

highlights: List[Highlight]

highlights_duration: float = 0

highlights_duration_text: str = ''

interval: int

interval_text: str = ''

mediaSource: str = 'No Media Source'

mediaTitle: str = 'No Media Title'

overallAvgActivityPerSecond: float = 0

overallAvgChatMessagesPerSecond: float = 0

overallAvgUniqueUsersPerSecond: float = 0

platform: str

print_process_progress(msg, idx, finished=False)

Prints progress of the chat download/process to the console.

If finished is true, normal printing is skipped and the last bar of progress is printed. This is important because we print progress every UPDATE_PROGRESS_INTERVAL messages, and the total number of messages is not usually divisible by this. We therefore have to slightly change the approach to printing progress for this special case.

process_chatlog(chatlog: Chat, source: str, settings: ProcessSettings)

Iterates through the whole chatlog and calculates the analytical data (Modifies and stores in a ChatAnalytics object).

Parameters

chatlog (chat_downloader.sites.common.Chat) – The chatlog we have downloaded
source (str) – The source of the media associated w the chatlog. URL of the media we have downloaded the log from, or a filepath
settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor

process_message(msg): Given a msg object from chat, update appropriate statistics based on the chat

program_version: str

samples: List[Sample]

spikes: List[Spike]

to_JSON()

totalActivity: int = 0

totalChatMessages: int = 0

totalUniqueUsers: int = 0

class chat_analyzer.dataformat.Highlight(startTime: float, endTime: float, description: str, type: str, peak: float, avg: float)

Bases: Section

Highlights reference a contiguous period of time where the provided metric remains above the percentile threshold.

—

type: str: The engagement metric. i.e. “avgActivityPerSecond”, “avgChatMessagesPerSecond”, “avgUniqueUsersPerSecond”, etc. NOTE: It is stored as its converted value (the name of the actual field), NOT the metric str the user provided in the CLI.
peak: float: The maximum value of the engagement metric throughout the whole Highlight (among the samples in the Highlight).
avg: float: The average value of the engagement metric throughout the whole Highlight (among the samples in the Highlight).

avg: float

peak: float

type: str

class chat_analyzer.dataformat.ProcessSettings(print_interval: int, msg_break: int, highlight_percentile: float, highlight_metric: str, spike_sensitivity: float)

Bases: object

Utility class for passing information from the analyzer to the chatlog processor and post-processor

print_interval: int: After ever ‘progress_interval’ messages, print a progress message. If <=0, progress printing is disabled
msg_break: int: (Mainly for Debug) Stop processing messages after BREAK number of messages have been processed.
highlight_percentile: float: The cutoff percentile that samples must meet to be considered a highlight
highlight_metric: str: The metric to use for engagement analysis to build highlights. NOTE: must be converted into actual Sample field name before use.
spike_sensitivity: float: How sensitive the spike detector is at picking up spikes. Higher sensitivity means more spikes are detected.

highlight_metric: str

highlight_percentile: float

msg_break: int

print_interval: int

spike_sensitivity: float

class chat_analyzer.dataformat.Sample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>)

Bases: object

Class that contains data of a specific time interval of the chat. Messages will be included in a sample if they are contained within [startTime, endTime)

—

[Defined when class Initialized]:

startTime: float: The start time (inclusive) (in seconds) corresponding to a sample.
endTime: float: The end time (exclusive) (in seconds) corresponding to a sample.

[Automatically Defined on init]:

startTime_text: str: The start time represented in text format (i.e. hh:mm:ss)
endTime_text: str: The end time represented in text format (i.e. hh:mm:ss)
sampleDuration: float: The duration (in seconds) of the sample (end-start) NOTE: Should be == to the selected interval in all except the last sample if the total duration of the chat is not divisible by the interval

[Defined w/ default and modified DURING analysis of sample]:

activity: int: The total number of messages/things (of any type!) that appeared in chat within the start/endTime of this sample. Includes messages,notifications,subscriptions, superchats, … anything that appeared in chat
chatMessages: int: The total number of chats sent by human (non-system) users (what is traditionally thought of as a chat) NOTE: Difficult to discern bots from humans other than just creating a known list of popular bots and blacklisting, because not all sites (YT/Twitch) provide information on whether chat was sent by a registered bot or not.
firstTimeChatters: int: The total number of users who sent their first message of the whole stream during this sample interval

[Defined w/ default and modified AFTER analysis of sample]:

uniqueUsers: int: The total number of unique users that sent a chat message across this sample interval (len(self._userChats))
avgActivityPerSecond: float: The average activity per second across this sample interval. (activity/sampleDuration)
avgChatMessagesPerSecond: float: The average number of chat messages per second across this sample interval. (totalChatMessages/sampleDuration)
avgUniqueUsersPerSecond: float: The average number of unique users that sent a chat across this sample interval. (uniqueUsers/sampleDuration)

activity: int = 0

avgActivityPerSecond: float = 0

avgChatMessagesPerSecond: float = 0

avgUniqueUsersPerSecond: float = 0

chatMessages: int = 0

endTime: float

endTime_text: str = ''

firstTimeChatters: int = 0

sampleDuration: float = -1

sample_post_process()

After we have finished adding messages to a particular sample (moving on to the next sample), we call sample_post_process() to process the cumulative data points (so we don’t have to do this every time we add a message)

Also removes the internal fields that don’t need to be output in the JSON object.

startTime: float

startTime_text: str = ''

uniqueUsers: int = 0

class chat_analyzer.dataformat.Section(startTime: float, endTime: float, description: str)

Bases: object

Contains generic information about a noteable section of the chatlog

—

[Defined when class Initialized]:

startTime: float: The start time (inclusive) (in seconds) corresponding to a section.
endTime: float: The end time (exclusive) (in seconds) corresponding to a section.
description: str (optional): A description of the section (if any).

[Automatically re-defined on post-init]:

duration: float: The duration (in seconds) of the section (end-start)
duration_text: str: The duration represented in text format (i.e. hh:mm:ss)
startTime_text: str: The start time represented in text format (i.e. hh:mm:ss)
endTime_text: str: The end time represented in text format (i.e. hh:mm:ss)

description: str

duration: float = 0.0

duration_text: str = ''

endTime: float

endTime_text: str = ''

startTime: float

startTime_text: str = ''

class chat_analyzer.dataformat.Spike(startTime: float, endTime: float, description: str)

Bases: Section

Contains information about an activity spike in the chatlog

TODO: Implement

description: str

endTime: float

startTime: float

class chat_analyzer.dataformat.TwitchChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None, totalSubscriptions: int = 0, totalGiftSubscriptions: int = 0, totalUpgradeSubscriptions: int = 0)

Bases: ChatAnalytics

Extension of the ChatAnalytics class, meant to contain data that all chats have and data specific to Twitch chats.

NOTE: Most twitch-specific attributes don’t make a lot of sense to continously report a per-second value, so we don’t!

—

(See ChatAnalytics class for common fields)

[Defined w/ default and modified DURING analysis]:

totalSubscriptions: int: The total number of subscriptions that appeared in the chat (which people purchased themselves).
totalGiftSubscriptions: int: The total number of gift subscriptions that appeared in the chat.
totalUpgradeSubscriptions: int: The total number of upgraded subscriptions that appeared in the chat.

chatlog_post_process(settings)

After we have finished iterating through the chatlog and constructing all of the samples, we call chatlog_post_process() to process the cumulative data points (so we don’t have to do this every time we add a sample).

This step is sometimes referred to as “analysis”.

Also removes the internal fields that don’t need to be output in the JSON object.

Parameters: settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor

process_message(msg): Given a msg object from chat, update common fields and twitch-specific fields

to_JSON()

totalGiftSubscriptions: int = 0

totalSubscriptions: int = 0

totalUpgradeSubscriptions: int = 0

class chat_analyzer.dataformat.TwitchSample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>, subscriptions: int = 0, giftSubscriptions: int = 0, upgradeSubscriptions: int = 0)

Bases: Sample

Class that contains data specific to Twitch of a specific time interval of the chat.

—

[Defined w/ default and modified DURING analysis of sample]:

subscriptions: int: The total number of subscriptions (that people purhcased themselves) that appeared in chat within the start/endTime of this sample.
giftSubscriptions: int: The total number of gift subscriptions that appeared in chat within the start/endTime of this sample.
upgradeSubscriptions: int: The total number of upgraded subscriptions that appeared in chat within the start/endTime of this sample.

giftSubscriptions: int = 0

subscriptions: int = 0

upgradeSubscriptions: int = 0

class chat_analyzer.dataformat.YoutubeChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None, totalSuperchats: int = 0, totalMemberships: int = 0)

Bases: ChatAnalytics

Extension of the ChatAnalytics class, meant to contain data that all chats have and data specific to YouTube chats.

NOTE: Most youtube-specific attributes don’t make a lot of sense to continously report a per-second value, so we don’t!

—

(See ChatAnalytics class for common fields and descriptions)

[Defined w/ default and modified DURING analysis]:

totalSuperchats: int: The total number of superchats (regular/ticker) that appeared in the chat. NOTE: A creator doesn’t necessarily care what form a superchat takes, so we just combine regular and ticker superchats
totalMemberships: int: The total number of memberships that appeared in the chat.

process_message(msg): Given a msg object from chat, update common fields and youtube-specific fields

to_JSON()

totalMemberships: int = 0

totalSuperchats: int = 0

class chat_analyzer.dataformat.YoutubeSample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>, superchats: int = 0, memberships: int = 0)

Bases: Sample

Class that contains data specific to Youtube of a specific time interval of the chat.

—

[Defined w/ default and modified DURING analysis of sample]:

superchats: int: The total number of superchats (regular/ticker) that appeared in chat within the start/endTime of this sample. NOTE: A creator doesn’t necessarily care what form a superchat takes, so we just combine regular and ticker superchats
memberships: int: The total number of memberships that appeared in chat within the start/endTime of this sample.

memberships: int = 0

superchats: int = 0

chat_analyzer.metadata module

Set metadata for chat-analyzer

chat_analyzer.util module

chat_analyzer.util.dprint(should_print: bool, s: str): Simple styled debug printer

chat_analyzer.util.remove_non_alpha_numeric(s: str) → str: Remove non-alphanumeric characters from a string and replaces all spacebars with underscores (Useful for normalizing the title of a video before turning it into a filename)

Module contents

Top-level package for chat-analyzer.