Developer’s Guide
To create new models for LdaOverTime library, you have to implement a new class that inherites and implements all DtmModelInterface’s method. Also, your model must have the following properties that have yet to be documented on the interface:
corpus
dates
dafe_format
freq
n_topics
sep
workers
To build and to publish, we use the hatch <https://hatch.pypa.io/latest/> package manager. Right bellow, we are listing some commands that are useful, feel free to add more examples:
build: hatch build
publish (build before): hatch publish -r “https://upload.pypi.org/legacy/” -u “<username>” -a “<password>”
lda_over_time.models.dtm_model_interface
DtmModelInterface is an interface to create new DTM modules, these modules should train the model and return values to front end.
- class lda_over_time.models.dtm_model_interface.DtmModelInterface[source]
Bases:
object
DtmModelInterface defines methods and attributes that a module should have in order to be passed to front end.
- Parameters
corpus (list[str]) – Each item from the list is one document from corpus.
dates (list[str]) – List of timestamps for each document in corpus, each date’s position should match with its respective text.
date_format (str) – The date format used in dates.
freq (str) – The frequency used to group texts.
n_topics (int, optional) – Number of topics that the DTM model should find. The default value is 100.
sep (str, optional) – Separator used to split each word, the default value is any blank space.
workers (int, optional) – Number of workers (cpus) to use. If not provided, it will use the value of multiprocessing.cpu_count()
- get_results()[source]
This method should return a table representing the evolution of each topic over time.
- Returns
It must return a Pandas’ dataframe where rows represents different time slices and they are sorted by date, it must have one column data and the remaining columns numbered from 0 to k (number of topics - 1) that holds weights of each topic in that period.
- Return type
pandas.core.frame.DataFrame
- get_topic_words(topic_id, i, n)[source]
This method should return the top n words that better describes the topic in a specific time slice.
- Parameters
topic_id (int) – The id of the desired topic.
i (int) – The position of the desired timeslice in chronological order the first (oldest) time slice is indexed by 1.
n (int) – This specifies how many words that better describes the topic at a specific time slice should be returned.
- Returns
It returns a list of top n words that best describes the requested topic in a specific time.
- Return type
list[str]
- property n_timeslices
This attribute should return the number of timeslices found.
- Returns
It should return the number of time slices found in corpus.
- Return type
int
- prepare_args(i)[source]
This method should return a dictionary with all necessary values to call PyLdaVis.prepare method.
- Parameters
i (int) – The position of the desired timeslice in chronological order, the first (oldest) time slice is indexed by 1.
- Returns
It returns a dictionary ready to be passed to PyLdaVis
- Return type
dict[str, any]