Models
lda_over_time.models.lda_seq_model
LdaSeqModel brings the Gensim’s LdaSeqModel functionalities to our library.
Its main advantage over other models is that it can detect changes in the vocabulary used to describe each topic over time, making it more precise in classifying each topic. But it is slower to run.
- class lda_over_time.models.lda_seq_model.LdaSeqModel(corpus: List[str], dates: List[str], date_format: str, freq: str, n_topics: int = 100, sep: Optional[str] = None, workers: Optional[int] = None)[source]
Bases:
DtmModelInterface
LdaSeqModel is a model that uses the Gensim’s LdaSeqModel, which supports the variance along time of the way that a certain topic is approached (it can detect better the change of vocabulary to speak a certain topic).
With this feature, it may be more precise than PrevalenceModel, but it is slower.
- Parameters
corpus (list[str]) – List of documents’ texts.
dates (list[str]) – List of documents’ publishing dates.
date_format (str) – The date format used in dates, e.g. “%Y/%m/%d” for “YYYY/MM/DD” format. More info at documentation.
freq (str) – The frequency used to group texts, e.g. “1M15D” for a frequency of a month and 15 days. Useful notations: day = “D” month = “M”; year = “Y”. More info at pandas
n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.
sep (str, optional) – Separator used to split each word, the default value is any blank space.
workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the total number of threads on running machine.
- Returns
Nothing
- Return type
None
- get_results()[source]
This method should return a table representing the evolution of each topic over time.
- Returns
Returns a Pandas’ dataframe where each column represents a timeslice and must have a date and columns representing each topics weight in that period.
- Return type
pd.core.frame.DataFrame
- get_topic_words(topic_id, i, n)[source]
This method returns the top n words that better describes the topic in a specific time slice.
- Parameters
topic_id (int) – The id of the desired topic.
i (int) – The position of the desired timeslice in chronological order the first (oldest) time slice is indexed by 1.
n (int) – This specifies how many words that better describes the topic at a specific time slice should be returned.
- Returns
It returns a list of top n words that best describes the requested topic in a specific time.
- Return type
list[str]
- property n_timeslices
This attribute should be the number of timeslices found during training.
- Returns
It should return the number of time slices found in corpus. :rtype: int
- prepare_args(i)[source]
This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.
- Parameters
i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.
- Returns
It returns a dictionary ready to be passed to PyLdaVis
- Return type
dict[str, any]
lda_over_time.models.temporal_lda_model
TemporalLdaModel is a simpler and faster temporal LDA that returns the proportion of main topics in each time slice.
Its main advantage over other models is that it is fast. But it may not handle well the variation of the way that a topic is presented (when vocabulary to describe the topic varies over the given dataset).
- class lda_over_time.models.temporal_lda_model.TemporalLdaModel(corpus: List[str], dates: List[str], date_format: str, freq: str, n_topics: int = 100, sep: Optional[str] = None, workers: Optional[int] = None, aggregator: str = 'average')[source]
Bases:
DtmModelInterface
TemporalLdaModel is a simple temporal LDA model, it is faster, but it may not handle well the evolution of topics (because the vocabulary used in a certain topic may vary over time).
:param corpus:Each item from the list is one document from corpus. :type corpus: list[str]
- Parameters
dates (list[str]) – List of timestamps for each document in corpus, each date’s position should match with its respective text.
date_format (str) – The date format used in dates, e.g. “%Y/%m/%d” for “YYYY/MM/DD” format. More info at documentation.
freq (str) – The frequency used to group texts, e.g. “1M15D” for a frequency of a month and 15 days. Useful notations: day = “D” month = “M”; year = “Y”. More info at pandas
n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.
sep (str, optional) – Separator used to split each word, the default value is any blank space.
workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the total number of threads on running machine.
aggregator (str, optional) – Specifies how to aggregate all documents in time slice and calculate its proportions. It can be either average to calculate the average of topic’s weights for each time slice or the proportion of main topics in each time slice. Default is average.
- Returns
Nothing
- Return type
None
- get_results()[source]
This method should return a table representing the evolution of each topic over time.
- Returns
Returns a Pandas’ dataframe where each column represents a timeslice and must have a date and columns representing each topics weight in that period.
- Return type
pd.core.frame.DataFrame
- get_topic_words(topic_id, i, n)[source]
This method should return the top n words that better describes the topic in a specific time slice.
- Parameters
topic_id (int) – The id of the desired topic.
i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.
:param n:This specifies how many words that better describes the topic at a specific time slice should be returned. :type n: int
- Returns
It returns a list of top n words that best describes the requested topic in a specific time.
- Return type
list[str]
- property n_timeslices
This attribute should be the number of timeslices found during training.
- Returns
It should return the number of time slices found in corpus. :rtype: int
- prepare_args(i)[source]
This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.
- Parameters
i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.
- Returns
It returns a dictionary ready to be passed to PyLdaVis
- Return type
dict[str, any]