Developer’s Guide

To create new models for LdaOverTime library, you have to implement a new class that inherites and implements all DtmModelInterface’s method. Also, your model must have the following properties that have yet to be documented on the interface:

  • corpus

  • dates

  • dafe_format

  • freq

  • n_topics

  • sep

  • workers

To build and to publish, we use the hatch <https://hatch.pypa.io/latest/> package manager. Right bellow, we are listing some commands that are useful, feel free to add more examples:

  • build: hatch build

  • publish (build before): hatch publish -r “https://upload.pypi.org/legacy/” -u “<username>” -a “<password>”

lda_over_time.models.dtm_model_interface

DtmModelInterface is an interface to create new DTM modules, these modules should train the model and return values to front end.

class lda_over_time.models.dtm_model_interface.DtmModelInterface[source]

Bases: object

DtmModelInterface defines methods and attributes that a module should have in order to be passed to front end.

Parameters
  • corpus (list[str]) – Each item from the list is one document from corpus.

  • dates (list[str]) – List of timestamps for each document in corpus, each date’s position should match with its respective text.

  • date_format (str) – The date format used in dates.

  • freq (str) – The frequency used to group texts.

  • n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.

  • sep (str, optional) – Separator used to split each word, the default value is any blank space.

  • workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the value of multiprocessing.cpu_count()

get_results()[source]

This method should return a table representing the evolution of each topic over time.

Returns

It must return a Pandas’ dataframe where rows represents different time slices and they are sorted by date, it must have one column data and the remaining columns numbered from 0 to k (number of topics - 1) that holds weights of each topic in that period.

Return type

pandas.core.frame.DataFrame

get_topic_words(topic_id, i, n)[source]

This method should return the top n words that better describes the topic in a specific time slice.

Parameters
  • topic_id (int) – The id of the desired topic.

  • i (int) – The position of the desired timeslice in chronological order the first (oldest) time slice is indexed by 1.

  • n (int) – This specifies how many words that better describes the topic at a specific time slice should be returned.

Returns

It returns a list of top n words that best describes the requested topic in a specific time.

Return type

list[str]

property n_timeslices

This attribute should return the number of timeslices found.

Returns

It should return the number of time slices found in corpus.

Return type

int

prepare_args(i)[source]

This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.

Parameters

i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.

Returns

It returns a dictionary ready to be passed to PyLdaVis

Return type

dict[str, any]

train()[source]

This method should train the dtm model.

Returns

nothing

Return type

None