Models

lda_over_time.models.lda_seq_model

LdaSeqModel brings the Gensim’s LdaSeqModel functionalities to our library.

Its main advantage over other models is that it can detect changes in the vocabulary used to describe each topic over time, making it more precise in classifying each topic. But it is slower to run.

class lda_over_time.models.lda_seq_model.LdaSeqModel(corpus: List[str], dates: List[str], date_format: str, freq: str, n_topics: int = 100, sep: Optional[str] = None, workers: Optional[int] = None)[source]

Bases: DtmModelInterface

LdaSeqModel is a model that uses the Gensim’s LdaSeqModel, which supports the variance along time of the way that a certain topic is approached (it can detect better the change of vocabulary to speak a certain topic).

With this feature, it may be more precise than PrevalenceModel, but it is slower.

Parameters

corpus (list[str]) – List of documents’ texts.
dates (list[str]) – List of documents’ publishing dates.
date_format (str) – The date format used in dates, e.g. “%Y/%m/%d” for “YYYY/MM/DD” format. More info at documentation.
freq (str) – The frequency used to group texts, e.g. “1M15D” for a frequency of a month and 15 days. Useful notations: day = “D” month = “M”; year = “Y”. More info at pandas
n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.
sep (str, optional) – Separator used to split each word, the default value is any blank space.
workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the total number of threads on running machine.

Returns

Nothing

Return type

None

get_results()[source]

This method should return a table representing the evolution of each topic over time.

Returns: Returns a Pandas’ dataframe where each column represents a timeslice and must have a date and columns representing each topics weight in that period.
Return type: pd.core.frame.DataFrame

get_topic_words(topic_id, i, n)[source]

This method returns the top n words that better describes the topic in a specific time slice.

Parameters

topic_id (int) – The id of the desired topic.
i (int) – The position of the desired timeslice in chronological order the first (oldest) time slice is indexed by 1.
n (int) – This specifies how many words that better describes the topic at a specific time slice should be returned.

Returns

It returns a list of top n words that best describes the requested topic in a specific time.

Return type

list[str]

property n_timeslices

This attribute should be the number of timeslices found during training.

Returns: It should return the number of time slices found in corpus. :rtype: int

prepare_args(i)[source]

This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.

Parameters: i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.
Returns: It returns a dictionary ready to be passed to PyLdaVis
Return type: dict[str, any]

train()[source]

Train the DTM model.

Returns: Nothing.
Return type: None

lda_over_time.models.temporal_lda_model

TemporalLdaModel is a simpler and faster temporal LDA that returns the proportion of main topics in each time slice.

Its main advantage over other models is that it is fast. But it may not handle well the variation of the way that a topic is presented (when vocabulary to describe the topic varies over the given dataset).

class lda_over_time.models.temporal_lda_model.TemporalLdaModel(corpus: List[str], dates: List[str], date_format: str, freq: str, n_topics: int = 100, sep: Optional[str] = None, workers: Optional[int] = None, aggregator: str = 'average')[source]

Bases: DtmModelInterface

TemporalLdaModel is a simple temporal LDA model, it is faster, but it may not handle well the evolution of topics (because the vocabulary used in a certain topic may vary over time).

:param corpus:Each item from the list is one document from corpus. :type corpus: list[str]

Parameters

dates (list[str]) – List of timestamps for each document in corpus, each date’s position should match with its respective text.
date_format (str) – The date format used in dates, e.g. “%Y/%m/%d” for “YYYY/MM/DD” format. More info at documentation.
freq (str) – The frequency used to group texts, e.g. “1M15D” for a frequency of a month and 15 days. Useful notations: day = “D” month = “M”; year = “Y”. More info at pandas
n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.
sep (str, optional) – Separator used to split each word, the default value is any blank space.
workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the total number of threads on running machine.
aggregator (str, optional) – Specifies how to aggregate all documents in time slice and calculate its proportions. It can be either average to calculate the average of topic’s weights for each time slice or the proportion of main topics in each time slice. Default is average.

Returns

Nothing

Return type

None

get_results()[source]

This method should return a table representing the evolution of each topic over time.

Returns: Returns a Pandas’ dataframe where each column represents a timeslice and must have a date and columns representing each topics weight in that period.
Return type: pd.core.frame.DataFrame

get_topic_words(topic_id, i, n)[source]

This method should return the top n words that better describes the topic in a specific time slice.

Parameters

topic_id (int) – The id of the desired topic.
i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.

:param n:This specifies how many words that better describes the topic at a specific time slice should be returned. :type n: int

Returns: It returns a list of top n words that best describes the requested topic in a specific time.
Return type: list[str]

property n_timeslices

This attribute should be the number of timeslices found during training.

Returns: It should return the number of time slices found in corpus. :rtype: int

prepare_args(i)[source]

This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.

Parameters: i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.
Returns: It returns a dictionary ready to be passed to PyLdaVis
Return type: dict[str, any]

train()[source]

This method trains the dtm model.

Returns: Nothing.
Return type: None