Developer’s Guide

To create new models for LdaOverTime library, you have to implement a new class that inherites and implements all DtmModelInterface’s method. Also, your model must have the following properties that have yet to be documented on the interface:

corpus

dates

dafe_format

freq

n_topics

sep

workers

To build and to publish, we use the hatch <https://hatch.pypa.io/latest/> package manager. Right bellow, we are listing some commands that are useful, feel free to add more examples:

build: hatch build

publish (build before): hatch publish -r “https://upload.pypi.org/legacy/” -u “<username>” -a “<password>”

lda_over_time.models.dtm_model_interface

DtmModelInterface is an interface to create new DTM modules, these modules should train the model and return values to front end.

class lda_over_time.models.dtm_model_interface.DtmModelInterface[source]

Bases: object

DtmModelInterface defines methods and attributes that a module should have in order to be passed to front end.

Parameters

corpus (list[str]) – Each item from the list is one document from corpus.
dates (list[str]) – List of timestamps for each document in corpus, each date’s position should match with its respective text.
date_format (str) – The date format used in dates.
freq (str) – The frequency used to group texts.
n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.
sep (str, optional) – Separator used to split each word, the default value is any blank space.
workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the value of multiprocessing.cpu_count()

get_results()[source]

This method should return a table representing the evolution of each topic over time.

Returns: It must return a Pandas’ dataframe where rows represents different time slices and they are sorted by date, it must have one column data and the remaining columns numbered from 0 to k (number of topics - 1) that holds weights of each topic in that period.
Return type: pandas.core.frame.DataFrame

get_topic_words(topic_id, i, n)[source]

This method should return the top n words that better describes the topic in a specific time slice.

Parameters

topic_id (int) – The id of the desired topic.
i (int) – The position of the desired timeslice in chronological order the first (oldest) time slice is indexed by 1.
n (int) – This specifies how many words that better describes the topic at a specific time slice should be returned.

Returns

It returns a list of top n words that best describes the requested topic in a specific time.

Return type

list[str]

property n_timeslices

This attribute should return the number of timeslices found.

Returns: It should return the number of time slices found in corpus.
Return type: int

prepare_args(i)[source]

This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.

Parameters: i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.
Returns: It returns a dictionary ready to be passed to PyLdaVis
Return type: dict[str, any]

train()[source]

This method should train the dtm model.

Returns: nothing
Return type: None