Models

lda_over_time.models.lda_seq_model

LdaSeqModel brings the Gensim’s LdaSeqModel functionalities to our library.

Its main advantage over other models is that it can detect changes in the vocabulary used to describe each topic over time, making it more precise in classifying each topic. But it is slower to run.

class lda_over_time.models.lda_seq_model.LdaSeqModel(corpus: List[str], dates: List[str], date_format: str, freq: str, n_topics: int = 100, sep: Optional[str] = None, workers: Optional[int] = None)[source]

Bases: DtmModelInterface

LdaSeqModel is a model that uses the Gensim’s LdaSeqModel, which supports the variance along time of the way that a certain topic is approached (it can detect better the change of vocabulary to speak a certain topic).

With this feature, it may be more precise than PrevalenceModel, but it is slower.

Parameters
  • corpus (list[str]) – List of documents’ texts.

  • dates (list[str]) – List of documents’ publishing dates.

  • date_format (str) – The date format used in dates, e.g. “%Y/%m/%d” for “YYYY/MM/DD” format. More info at documentation.

  • freq (str) – The frequency used to group texts, e.g. “1M15D” for a frequency of a month and 15 days. Useful notations: day = “D” month = “M”; year = “Y”. More info at pandas

  • n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.

  • sep (str, optional) – Separator used to split each word, the default value is any blank space.

  • workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the total number of threads on running machine.

Returns

Nothing

Return type

None

get_results()[source]

This method should return a table representing the evolution of each topic over time.

Returns

Returns a Pandas’ dataframe where each column represents a timeslice and must have a date and columns representing each topics weight in that period.

Return type

pd.core.frame.DataFrame

get_topic_words(topic_id, i, n)[source]

This method returns the top n words that better describes the topic in a specific time slice.

Parameters
  • topic_id (int) – The id of the desired topic.

  • i (int) – The position of the desired timeslice in chronological order the first (oldest) time slice is indexed by 1.

  • n (int) – This specifies how many words that better describes the topic at a specific time slice should be returned.

Returns

It returns a list of top n words that best describes the requested topic in a specific time.

Return type

list[str]

property n_timeslices

This attribute should be the number of timeslices found during training.

Returns

It should return the number of time slices found in corpus. :rtype: int

prepare_args(i)[source]

This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.

Parameters

i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.

Returns

It returns a dictionary ready to be passed to PyLdaVis

Return type

dict[str, any]

train()[source]

Train the DTM model.

Returns

Nothing.

Return type

None

lda_over_time.models.temporal_lda_model

TemporalLdaModel is a simpler and faster temporal LDA that returns the proportion of main topics in each time slice.

Its main advantage over other models is that it is fast. But it may not handle well the variation of the way that a topic is presented (when vocabulary to describe the topic varies over the given dataset).

class lda_over_time.models.temporal_lda_model.TemporalLdaModel(corpus: List[str], dates: List[str], date_format: str, freq: str, n_topics: int = 100, sep: Optional[str] = None, workers: Optional[int] = None, aggregator: str = 'average')[source]

Bases: DtmModelInterface

TemporalLdaModel is a simple temporal LDA model, it is faster, but it may not handle well the evolution of topics (because the vocabulary used in a certain topic may vary over time).

:param corpus:Each item from the list is one document from corpus. :type corpus: list[str]

Parameters
  • dates (list[str]) – List of timestamps for each document in corpus, each date’s position should match with its respective text.

  • date_format (str) – The date format used in dates, e.g. “%Y/%m/%d” for “YYYY/MM/DD” format. More info at documentation.

  • freq (str) – The frequency used to group texts, e.g. “1M15D” for a frequency of a month and 15 days. Useful notations: day = “D” month = “M”; year = “Y”. More info at pandas

  • n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.

  • sep (str, optional) – Separator used to split each word, the default value is any blank space.

  • workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the total number of threads on running machine.

  • aggregator (str, optional) – Specifies how to aggregate all documents in time slice and calculate its proportions. It can be either average to calculate the average of topic’s weights for each time slice or the proportion of main topics in each time slice. Default is average.

Returns

Nothing

Return type

None

get_results()[source]

This method should return a table representing the evolution of each topic over time.

Returns

Returns a Pandas’ dataframe where each column represents a timeslice and must have a date and columns representing each topics weight in that period.

Return type

pd.core.frame.DataFrame

get_topic_words(topic_id, i, n)[source]

This method should return the top n words that better describes the topic in a specific time slice.

Parameters
  • topic_id (int) – The id of the desired topic.

  • i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.

:param n:This specifies how many words that better describes the topic at a specific time slice should be returned. :type n: int

Returns

It returns a list of top n words that best describes the requested topic in a specific time.

Return type

list[str]

property n_timeslices

This attribute should be the number of timeslices found during training.

Returns

It should return the number of time slices found in corpus. :rtype: int

prepare_args(i)[source]

This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.

Parameters

i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.

Returns

It returns a dictionary ready to be passed to PyLdaVis

Return type

dict[str, any]

train()[source]

This method trains the dtm model.

Returns

Nothing.

Return type

None