chat_analyzer package

Submodules

chat_analyzer.analyzer module

chat_analyzer.analyzer.check_chatlog_downloader_supported(chatlog: Chat, url: str)

Ensures that we are able to properly analyze the downloaded chatlog by enforcing the metadata matches expected values. (Our own logic depends on certain things being true)

If there is a breach of compliance, we exit (we consider this to be a fatal error)

Parameters
  • chatlog (chat_downloader.sites.common.Chat) – The chatlog we have downloaded

  • url (str) – The URL of the video we have downloaded the log from

Currently we ensure:
  • The video/stream happened in the past (not currently live or scheduled)

  • The chat is downloaded from a supported platform that we have a proper

way of parsing.

chat_analyzer.analyzer.get_ChatAnalytics_from_file(filepath: str)

Given a path to a previous output file of this program containing analytical data, extract the data into a ChatAnalytics object and return it.

Parameters

filepath (str) – a path to a previous output file of this program containing analytical data

Returns

a ChatAnalytics object

Return type

ChatAnalytics

chat_analyzer.analyzer.get_chatlog_downloader(url: str)

Gets a chat-downloader generator using Xenonva’s chat-downloader

Parameters
  • url – The URL of the past stream/VOD to download the chat from

  • type – str

Returns

The chatlog we have downloaded

Return type

chat_downloader.sites.common.Chat

chat_analyzer.analyzer.get_chatmsgs_from_chatfile(filepath: str)

Given a path to a chatfile (directly from Xenovas downloader or produced by –save-chatfile-output) flag, extract the chat messages from the file into an array of chat messages.

Parameters

filepath – a path to a chatfile (directly from Xenovas downloader or produced by

–save-chatfile-output) flag :type filepath: str :returns: an array of chat messages :rtype: list[chat_downloader.sites.common.ChatMessage]

chat_analyzer.analyzer.output_json_to_file(json_obj, filepath)
chat_analyzer.analyzer.run(**kwargs)

Runs the chat-analyzer

Returns

The chat analytics data as a dataclass

Return type

dataformat.ChatAnalytics (dataformat.YoutubeChatAnalytics or dataformat.TwitchChatAnalytics)

chat_analyzer.cli module

class chat_analyzer.cli.SmartFormatter(prog, indent_increment=2, max_help_position=24, width=None)

Bases: ArgumentDefaultsHelpFormatter

Any help string starting with ‘R|’ has its newlines (

) preserved, in addition to

keeping the fxnality from the HelpFormatter (displaying defaults next to descriptions).

Adapted from and thanks to: https://stackoverflow.com/questions/3853722/how-to-insert-newlines-on-argparse-help-text

chat_analyzer.cli.check_interval(interval)

Based on the MAX_INTERVAL and MIN_INTERVAL from chat_analyzer.py, ensure that the entered interval respects that boundary

chat_analyzer.cli.check_percentile_float(value)

Check that the value is between 0 and 100 exclusive

chat_analyzer.cli.check_positive_int(value)

Check that the value is a positive integer

chat_analyzer.cli.main()

chat_analyzer.dataformat module

class chat_analyzer.dataformat.ChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None)

Bases: ABC

Class that contains the results of the chat data analysis/processing.

An instance of a subclass is created and then modified throughout the analysis process. After the processing of the data is complete, the object will contain all relevant results we are looking for.

This class cannot be directly instantiated, see the subclasses YoutubeChatAnalytics & TwitchChatAnalytics. YT and Twitch chats report/record data differently and contain site-specific events, so we centralize common data/fxnality and separate specifics into subclasses.

The object can then be converted to JSON/printed/manipulated as desired to format/output the results as necessary.

[Defined when class Initialized]:

duration: float

The total duration (in seconds) of the associated video/media. Message times correspond to the video times.

interval: int

The time interval (in seconds) at which to compress datapoints into samples. i.e. Duration of the samples. The smaller the interval, the more granular the analytics are. At interval=5, each sample contains 5 seconds of cumulative data. (With the exception of the last sample, which may be shorter than the interval.) This is b/c media duration is not necessarily divisible by the interval. #(samples in raw_data) is about (video duration/interval) (+1 if necessary to encompass remaining non-divisible data at end of data).

description: str

A description included to help distinguish it from other analytical data.

program_version: str

The version of the chat analytics program that was used to generate the data. Helps identify outdated/version-specific data formats.

platform: str

Used to store the platform the data came from: ‘www.youtube.com’, ‘www.twitch.tv’, ‘youtu.be’… While it technically can be determined by the type of subclass, this makes for easier conversion to JSON/output

[Automatically re-defined on post-init]:

duration_text: str

String representation of the media duration time.

interval_text: str

String representation of the interval time.

[Defined w/ default and modified DURING analysis]:

mediaTitle: str

The title of the media associated with the chatlog.

mediaSource: str

The link to the media associated with the chatlog (url that it was origianlly downloaded from or filepath of a chatfile).

samples: List[Sample]

An array of sequential samples, each corresponding to data about a section of chat of ‘interval’ seconds long. Each sample has specific data corresponding to a time interval of the vid. See the ‘Sample’ class

totalActivity: int

The total number of messages/things (of any type!) that appeared in chat. (Sum of intervalActivity from all samples) Includes messages,notifications,subscriptions, superchats, … anything that appeared in chat

totalChatMessages: int

The total number of chats sent by human (non-system) users (what is traditionally thought of as a chat) NOTE: Difficult to discern bots from humans other than just creating a known list of popular bots and blacklisting, because not all sites (YT/Twitch) provide information on whether chat was sent by a registered bot or not.

highlight_percentile: float

The cutoff percentile that samples must meet to be considered a highlight

highlight_metric: str

The metric to use for engagement analysis to build highlights. NOTE: must be converted into actual Sample field name before use.

[Defined w/ default and modified AFTER analysis]:

totalUniqueUsers: int

The total number of unique users that sent a chat message (human users that sent at least one traditional chat)

overallAvgActivityPerSecond: float

The average activity per second across the whole chatlog. (totalActivity/totalDuration)

overallAvgChatMessagesPerSecond: float

The average number of chat messages per second across the whole chatlog. (totalChatMessages/totalDuration)

overallAvgUniqueUsersPerSecond: float

The average number of unique users chatting per second.

highlights: List[Highlight]

A list of the high engagement sections of the chatlog.

highlights_duration: float

The cumulative duration of the highlights (in seconds)

highlights_duration_text: str

The cumulative duration of the highlights represented in text format (i.e. hh:mm:ss)

spikes: List[Spike]

Not yet implemented TODO A list of the calculated spikes in the chatlog. May contain spikes of different types, identifiable by the spike’s type field.

chatlog_post_process(settings: ProcessSettings)

After we have finished iterating through the chatlog and constructing all of the samples, we call chatlog_post_process() to process the cumulative data points (so we don’t have to do this every time we add a sample).

This step is sometimes referred to as “analysis”.

Also removes the internal fields that don’t need to be output in the JSON object.

Parameters

settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor

create_new_sample()

Post-processes the previous sample, then appends & creates a new sample following the previous sample sequentially. If a previous sample doesn’t exist, creates the first sample.

NOTE: If there there are only 2 chats, one at time 0:03, and the other at 5:09:12, there are still a lot of empty samples in between (because we still want to graph/track the silence times with temporal stability)

description: str
duration: float
duration_text: str = ''
get_highlights(highlight_metric: str, highlight_percentile: float)

Highlights reference a contiguous period of time where the provided metric remains above the percentile threshold. Find and return a list of highlights referencing the start and end times of samples whose highlight_metric is in the highlight_percentile for contiguous period of time of the referenced samples.

A highlight may reference more than one sample if contiguous samples meet the percentile cutoff.

Samples in the top ‘percentile’% of the selected engagement metric will be considered high-engagement samples and included in the highlights output list. The larger the percentile, the greater the metric requirement before being reported. If ‘engagement-percentile’=93.0, any sample in the 93rd percentile (top 7.0%%) of the selected metric will be considered an engagement highlight.

These high-engagement portions of the chatlog are stored as highlights, and may last for multiple samples.

This method should only be called after the averages have been calculated, ensuring accurate results when determining periods of high engagement.

Parameters
  • highlight_metric – The metric samples are compared to determine if they are high-engagement samples. NOTE: Internally converted to the actual field name of a sample field.

  • highlight_percentile – The cutoff percentile that the samples must meet to be included in a highlight

Returns

a list of highlights referencing samples that met the percentile cutoff requirements for the provided metric

Return type

List[Highlight]

get_spikes(spike_sensitivity, spike_metric)

A spike is a point in the chatlog where from one sample to the next, there is a sharp increase in the provided metric.

…? Are spikes sustained or..? ?: A spike is a point in the chatlog where the activity is significantly different from the average activity. Activity is significantly different if it is > avg*SPIKE_MULT_THRESHOLD. We detect a spike if the high activity level is maintained for at least SPIKE_SUSTAIN_REQUIREMENT # of samples.

highlight_metric: str = ''
highlight_percentile: float = 0
highlights: List[Highlight]
highlights_duration: float = 0
highlights_duration_text: str = ''
interval: int
interval_text: str = ''
mediaSource: str = 'No Media Source'
mediaTitle: str = 'No Media Title'
overallAvgActivityPerSecond: float = 0
overallAvgChatMessagesPerSecond: float = 0
overallAvgUniqueUsersPerSecond: float = 0
platform: str
print_process_progress(msg, idx, finished=False)

Prints progress of the chat download/process to the console.

If finished is true, normal printing is skipped and the last bar of progress is printed. This is important because we print progress every UPDATE_PROGRESS_INTERVAL messages, and the total number of messages is not usually divisible by this. We therefore have to slightly change the approach to printing progress for this special case.

process_chatlog(chatlog: Chat, source: str, settings: ProcessSettings)

Iterates through the whole chatlog and calculates the analytical data (Modifies and stores in a ChatAnalytics object).

Parameters
  • chatlog (chat_downloader.sites.common.Chat) – The chatlog we have downloaded

  • source (str) – The source of the media associated w the chatlog. URL of the media we have downloaded the log from, or a filepath

  • settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor

process_message(msg)

Given a msg object from chat, update appropriate statistics based on the chat

program_version: str
samples: List[Sample]
spikes: List[Spike]
to_JSON()
totalActivity: int = 0
totalChatMessages: int = 0
totalUniqueUsers: int = 0
class chat_analyzer.dataformat.Highlight(startTime: float, endTime: float, description: str, type: str, peak: float, avg: float)

Bases: Section

Highlights reference a contiguous period of time where the provided metric remains above the percentile threshold.

type: str

The engagement metric. i.e. “avgActivityPerSecond”, “avgChatMessagesPerSecond”, “avgUniqueUsersPerSecond”, etc. NOTE: It is stored as its converted value (the name of the actual field), NOT the metric str the user provided in the CLI.

peak: float

The maximum value of the engagement metric throughout the whole Highlight (among the samples in the Highlight).

avg: float

The average value of the engagement metric throughout the whole Highlight (among the samples in the Highlight).

avg: float
peak: float
type: str
class chat_analyzer.dataformat.ProcessSettings(print_interval: int, msg_break: int, highlight_percentile: float, highlight_metric: str, spike_sensitivity: float)

Bases: object

Utility class for passing information from the analyzer to the chatlog processor and post-processor

print_interval: int

After ever ‘progress_interval’ messages, print a progress message. If <=0, progress printing is disabled

msg_break: int

(Mainly for Debug) Stop processing messages after BREAK number of messages have been processed.

highlight_percentile: float

The cutoff percentile that samples must meet to be considered a highlight

highlight_metric: str

The metric to use for engagement analysis to build highlights. NOTE: must be converted into actual Sample field name before use.

spike_sensitivity: float

How sensitive the spike detector is at picking up spikes. Higher sensitivity means more spikes are detected.

highlight_metric: str
highlight_percentile: float
msg_break: int
print_interval: int
spike_sensitivity: float
class chat_analyzer.dataformat.Sample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>)

Bases: object

Class that contains data of a specific time interval of the chat. Messages will be included in a sample if they are contained within [startTime, endTime)

[Defined when class Initialized]:

startTime: float

The start time (inclusive) (in seconds) corresponding to a sample.

endTime: float

The end time (exclusive) (in seconds) corresponding to a sample.

[Automatically Defined on init]:

startTime_text: str

The start time represented in text format (i.e. hh:mm:ss)

endTime_text: str

The end time represented in text format (i.e. hh:mm:ss)

sampleDuration: float

The duration (in seconds) of the sample (end-start) NOTE: Should be == to the selected interval in all except the last sample if the total duration of the chat is not divisible by the interval

[Defined w/ default and modified DURING analysis of sample]:

activity: int

The total number of messages/things (of any type!) that appeared in chat within the start/endTime of this sample. Includes messages,notifications,subscriptions, superchats, … anything that appeared in chat

chatMessages: int

The total number of chats sent by human (non-system) users (what is traditionally thought of as a chat) NOTE: Difficult to discern bots from humans other than just creating a known list of popular bots and blacklisting, because not all sites (YT/Twitch) provide information on whether chat was sent by a registered bot or not.

firstTimeChatters: int

The total number of users who sent their first message of the whole stream during this sample interval

[Defined w/ default and modified AFTER analysis of sample]:

uniqueUsers: int

The total number of unique users that sent a chat message across this sample interval (len(self._userChats))

avgActivityPerSecond: float

The average activity per second across this sample interval. (activity/sampleDuration)

avgChatMessagesPerSecond: float

The average number of chat messages per second across this sample interval. (totalChatMessages/sampleDuration)

avgUniqueUsersPerSecond: float

The average number of unique users that sent a chat across this sample interval. (uniqueUsers/sampleDuration)

activity: int = 0
avgActivityPerSecond: float = 0
avgChatMessagesPerSecond: float = 0
avgUniqueUsersPerSecond: float = 0
chatMessages: int = 0
endTime: float
endTime_text: str = ''
firstTimeChatters: int = 0
sampleDuration: float = -1
sample_post_process()

After we have finished adding messages to a particular sample (moving on to the next sample), we call sample_post_process() to process the cumulative data points (so we don’t have to do this every time we add a message)

Also removes the internal fields that don’t need to be output in the JSON object.

startTime: float
startTime_text: str = ''
uniqueUsers: int = 0
class chat_analyzer.dataformat.Section(startTime: float, endTime: float, description: str)

Bases: object

Contains generic information about a noteable section of the chatlog

[Defined when class Initialized]:

startTime: float

The start time (inclusive) (in seconds) corresponding to a section.

endTime: float

The end time (exclusive) (in seconds) corresponding to a section.

description: str (optional)

A description of the section (if any).

[Automatically re-defined on post-init]:

duration: float

The duration (in seconds) of the section (end-start)

duration_text: str

The duration represented in text format (i.e. hh:mm:ss)

startTime_text: str

The start time represented in text format (i.e. hh:mm:ss)

endTime_text: str

The end time represented in text format (i.e. hh:mm:ss)

description: str
duration: float = 0.0
duration_text: str = ''
endTime: float
endTime_text: str = ''
startTime: float
startTime_text: str = ''
class chat_analyzer.dataformat.Spike(startTime: float, endTime: float, description: str)

Bases: Section

Contains information about an activity spike in the chatlog

TODO: Implement

description: str
endTime: float
startTime: float
class chat_analyzer.dataformat.TwitchChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None, totalSubscriptions: int = 0, totalGiftSubscriptions: int = 0, totalUpgradeSubscriptions: int = 0)

Bases: ChatAnalytics

Extension of the ChatAnalytics class, meant to contain data that all chats have and data specific to Twitch chats.

NOTE: Most twitch-specific attributes don’t make a lot of sense to continously report a per-second value, so we don’t!

(See ChatAnalytics class for common fields)

[Defined w/ default and modified DURING analysis]:

totalSubscriptions: int

The total number of subscriptions that appeared in the chat (which people purchased themselves).

totalGiftSubscriptions: int

The total number of gift subscriptions that appeared in the chat.

totalUpgradeSubscriptions: int

The total number of upgraded subscriptions that appeared in the chat.

chatlog_post_process(settings)

After we have finished iterating through the chatlog and constructing all of the samples, we call chatlog_post_process() to process the cumulative data points (so we don’t have to do this every time we add a sample).

This step is sometimes referred to as “analysis”.

Also removes the internal fields that don’t need to be output in the JSON object.

Parameters

settings (ProcessSettings) – Utility class for passing information from the analyzer to the chatlog processor and post-processor

process_message(msg)

Given a msg object from chat, update common fields and twitch-specific fields

to_JSON()
totalGiftSubscriptions: int = 0
totalSubscriptions: int = 0
totalUpgradeSubscriptions: int = 0
class chat_analyzer.dataformat.TwitchSample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>, subscriptions: int = 0, giftSubscriptions: int = 0, upgradeSubscriptions: int = 0)

Bases: Sample

Class that contains data specific to Twitch of a specific time interval of the chat.

[Defined w/ default and modified DURING analysis of sample]:

subscriptions: int

The total number of subscriptions (that people purhcased themselves) that appeared in chat within the start/endTime of this sample.

giftSubscriptions: int

The total number of gift subscriptions that appeared in chat within the start/endTime of this sample.

upgradeSubscriptions: int

The total number of upgraded subscriptions that appeared in chat within the start/endTime of this sample.

giftSubscriptions: int = 0
subscriptions: int = 0
upgradeSubscriptions: int = 0
class chat_analyzer.dataformat.YoutubeChatAnalytics(duration: float, interval: int, description: str, program_version: str, platform: str, duration_text: str = '', interval_text: str = '', mediaTitle: str = 'No Media Title', mediaSource: str = 'No Media Source', samples: ~typing.List[~chat_analyzer.dataformat.Sample] = <factory>, totalActivity: int = 0, totalChatMessages: int = 0, totalUniqueUsers: int = 0, overallAvgActivityPerSecond: float = 0, overallAvgChatMessagesPerSecond: float = 0, overallAvgUniqueUsersPerSecond: float = 0, highlights: ~typing.List[~chat_analyzer.dataformat.Highlight] = <factory>, highlights_duration: float = 0, highlights_duration_text: str = '', highlight_percentile: float = 0, highlight_metric: str = '', spikes: ~typing.List[~chat_analyzer.dataformat.Spike] = <factory>, _overallUserChats: dict = <factory>, _currentSample: ~typing.Optional[~chat_analyzer.dataformat.Sample] = None, totalSuperchats: int = 0, totalMemberships: int = 0)

Bases: ChatAnalytics

Extension of the ChatAnalytics class, meant to contain data that all chats have and data specific to YouTube chats.

NOTE: Most youtube-specific attributes don’t make a lot of sense to continously report a per-second value, so we don’t!

(See ChatAnalytics class for common fields and descriptions)

[Defined w/ default and modified DURING analysis]:

totalSuperchats: int

The total number of superchats (regular/ticker) that appeared in the chat. NOTE: A creator doesn’t necessarily care what form a superchat takes, so we just combine regular and ticker superchats

totalMemberships: int

The total number of memberships that appeared in the chat.

process_message(msg)

Given a msg object from chat, update common fields and youtube-specific fields

to_JSON()
totalMemberships: int = 0
totalSuperchats: int = 0
class chat_analyzer.dataformat.YoutubeSample(startTime: float, endTime: float, sampleDuration: float = -1, startTime_text: str = '', endTime_text: str = '', activity: int = 0, chatMessages: int = 0, firstTimeChatters: int = 0, uniqueUsers: int = 0, avgActivityPerSecond: float = 0, avgChatMessagesPerSecond: float = 0, avgUniqueUsersPerSecond: float = 0, _userChats: dict = <factory>, superchats: int = 0, memberships: int = 0)

Bases: Sample

Class that contains data specific to Youtube of a specific time interval of the chat.

[Defined w/ default and modified DURING analysis of sample]:

superchats: int

The total number of superchats (regular/ticker) that appeared in chat within the start/endTime of this sample. NOTE: A creator doesn’t necessarily care what form a superchat takes, so we just combine regular and ticker superchats

memberships: int

The total number of memberships that appeared in chat within the start/endTime of this sample.

memberships: int = 0
superchats: int = 0

chat_analyzer.metadata module

Set metadata for chat-analyzer

chat_analyzer.util module

chat_analyzer.util.dprint(should_print: bool, s: str)

Simple styled debug printer

chat_analyzer.util.remove_non_alpha_numeric(s: str) str

Remove non-alphanumeric characters from a string and replaces all spacebars with underscores (Useful for normalizing the title of a video before turning it into a filename)

Module contents

Top-level package for chat-analyzer.