unipy_nlp.analyze package

Submodules

unipy_nlp.analyze.topic_modeling module

Topic Modeling(LDA) & Word2Vec.

class unipy_nlp.analyze.topic_modeling.TopicModeler(sentence_list, tokenized_sentence_list)[source]

Bases: object

Topic Modeling via LDA(Latent Diriclet Allocation).

Get tokenized from text.

Parameters
  • sentence_list (list) – A list of raw sentences.

  • tokenized_sentence_list (list) – A nested list of tokenized sentences.

After `__init__`
self.sentences:

A list of raw sentences.

self.tokenized: list

A nested list of tokenized sentences.

self.corpora_dict: gensim.corpora.dictionary.Dictionary

A token dictionary from a given text.

self.bow_corpus_idx: list

A nested list, which contains converted documents into a list of token indices.

self.bow_corpus_doc: list

A nested list, which contains converted documents into a list of token words.

After `train_lda` or `load_lda`
self.best_lda_model: dict

A dict contains the best model & its coherence value. {‘coherence’: int, ‘model’:gensim.models.ldamulticore.LdaMulticore}

self.lda_model_list = model_coh_list

A nested list of [topic_num, model, coherence_value]

self.lda_model_dict:

A nested dict as {topic_num: {‘coherence’: int, ‘model’: `gensim.models.ldamulticore.LdaMulticore}}`

self.trained: bool

True If trained or properly loaded.

After `visualize_lda_to_html`
self.selected_topic_num: int

A int of selected topic number.

self.selected_model: gensim.models.ldamulticore.LdaMulticore

self.vis_prepared: pyLDAvis.prepared_data.PreparedData

self.total_terms_df

tinfo_table, ‘Default’ removed.

self.top_relevant_terms_df: pandas.DataFrame

A rank table of Category.

self.r_adj_score_df: pandas.DataFrame

A tinfo table, considering saliency and relevence score.

self.bow_score_list: list

Scores of each sentence, based on bow_corpus, clipped by (0, 3).

After `estimate_topics_by_documents` or `load_estimated`
self.dominant_topic_estimation_df: pandas.DataFrame

A dataframe contains [‘lda_prob’, ‘dominant_topic’, ‘contribution’, ‘topic_keywords’]

self.topic_freq_df: pandas.DataFrame

A rank table by topic frequency.

After `get_representitive_documents` or `load_representitive_documents`

self.representitive_docs: pandas.DataFrame

After `get_representitive_candidates`

return repr_sentences, repr_bow_corpus_doc, repr_bow_corpus_idx

train_lda()[source]
save_lda()[source]
load_lda()[source]
pick_best_lda_topics()[source]
visualize_lda_to_html()[source]
estimate_topics_by_documents()[source]
get_representitive_documents()[source]

See also

Preprocessing

unipy_nlp.preprocessing.Preprocessor

POS-Tagging

konlpy.tag.Mecab

Byte-Pair

sentencepiece

Examples

>>> import unipy_nlp.data_collector as udcl
>>> import unipy_nlp.preprocessing as uprc
>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> from pprint import pprint
>>> prep = uprc.Preprocessor()
>>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json')
>>> sentence_for_pos_list = [
...     "무궁화 꽃이 피었습니다."
...     "우리는 민족중흥의 역사적 사명을 띠고 이 땅에 태어났다.",
... ]
>>> tokenized = prep.pos_tag(
...     input_text=sentence_for_pos_list,
...     tag_type=[
...         '체언 접두사', '명사', '한자', '외국어',
...         '수사', '구분자',
...         '동사',
...         '부정 지정사', '긍정 지정사',
...     ]
... )
>>> print(tokenized)
[['무궁화'], ['우리', '민족중흥', '역사', '사명']]
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.train_lda(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> tpm.save_lda(savepath='data/_tmp_dump/topic_modeling', affix='lda')
>>> tpm.load_lda('data/_tmp_dump/topic_modeling')
>>> tpm.pick_best_lda_topics(
...     num_topic_list=[5, 7, 10],
...     workers_n=8,
...     random_seed=1,
... )
>>> tpm.visualize_lda_to_html(
...     7,
...     top_n=10,
...     r_normalized=False,
...     relevence_lambda_val=.6,
...     workers_n=8,
...     random_seed=1,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
...     # save_type='html',  # {'html', 'json'}
...     save_relevent_terms_ok=True,
...     save_html_ok=True,
...     display_ok=False,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     # sentence_list=tokenized,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> sentence_repr = tpm.get_representitive_documents(
...     7,
...     len_range=(10, 30),
...     top_n=10,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
estimate_topics_by_documents(target_topic_num, random_seed=1, save_ok=True, savepath='./', filename_affix='lda')[source]

Get dominant topics & its contribution scores from each documents.

Parameters
  • target_topic_num (int) – A topic number of LDA model to use.

  • random_seed (int (default: 1)) – A random seed number.

  • save_ok (bool (default: True)) – Save return pandas.DataFrame.

  • savepath (str (default: ‘./’)) – A dirpath to save the topic-labeled sentences.

  • filename_affix (str (default: ‘lda’)) – An affix of filename to save the topic-labeled sentences.

Returns

  • dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.

  • topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
get_best_n_terms()[source]
get_representitive_candidates(len_range=(10, 30))[source]

Get representitive candidates by length. It is for to use unipy_nlp.network_plot.

Parameters

len_range (list or tuple (default: (10, 30))) – A candidate threshold by length.

Returns

  • repr_sentences (list) – A list of sentences.

  • repr_bow_corpus_doc (list) – A nested list, which contains converted documents into a list of token words.

  • repr_bow_corpus_idx (list) – A nested list, which contains converted documents into a list of token indices..

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> (repr_sentenced,
>>>  repr_bow_corpus_doc,
>>>  repr_bow_corpus_idx) = tpm.get_representitive_candidates(
...     len_range=(12, 30),
... )
get_representitive_documents(target_topic_num, len_range=(10, 30), top_n=10, save_ok=True, savepath='./', filename_affix='lda')[source]

List-up the most representitive documents by topic.

Parameters
  • target_topic_num (int) – A topic number of LDA model to use.

  • len_range (list or tuple (default: (10, 30))) – A candidate threshold by length.

  • top_n (int (default: 10)) – A document number to list-up, by topic.

  • save_ok (bool (default: True)) – An option to save.

  • savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.

  • filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.

Returns

reordered – Representitive documents, group by topic, ordery by its rank.

Return type

pandas.DataFrame

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> sentence_repr = tpm.get_representitive_documents(
...     7,
...     len_range=(10, 30),
...     top_n=10,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
load_estimated(target_topic_num, savepath='./', filename_affix='lda')[source]

Load the result of self.estimate_topics_by_documents.

Parameters
  • target_topic_num (int) – A topic number of LDA model to use.

  • savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.

  • filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.

Returns

  • dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.

  • topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> sentence_labeled, topic_freq = tpm.load_estimated(
...     target_topic_num=7,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
load_lda(filepath)[source]

Load trained lda model(s).

Parameters

filepath (str) – A dirpath to load. It contains .ldamodel.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.load_lda('data/_tmp_dump/topic_modeling')
load_representitive_documents(target_topic_num, top_n=10, savepath='./', filename_affix='lda')[source]

Load the result of self.get_representitive_documents.

Parameters
  • target_topic_num (int) – A topic number of LDA model to use.

  • top_n (int (default: 10)) – A document number to list-up, by topic. The upper bound depends on how many documents saved.

  • savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.

  • filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.

Returns

  • dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.

  • topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> sentence_labeled, topic_freq = tpm.load_estimated(
...     target_topic_num=7,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
pick_best_lda_topics(num_topic_list=[5, 7, 10, 12, 15, 17, 20], lda_type='default', workers_n=2, random_seed=1)[source]

Train multiple LDA Topic Models by given topic numbers.

Parameters
  • num_topic_list (list (default: [5, 7, 10, 12, 15, 17, 20])) – A list of topic numbers.

  • lda_type (str (default: ‘default’, {‘default’, ‘hdp’, ‘mallet’})) – A type of LDA model. Use ‘default’ for now. Other options are working in progress.

  • workers_n (int (default: 2)) – A number of CPU core to train.

  • random_seed (int (default: 1)) – A random seed int.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
save_lda(savepath='./', affix='lda')[source]

Save trained lda model(s).

Parameters
  • savepath (str (default: ‘./’)) – A dirpath to save.

  • affix (str (default: ‘lda’)) – An affix of filename. Its ext will be .ldamodel.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> tpm.save_lda(savepath='data/_tmp_dump/topic_modeling', affix='lda')
train_lda(num_topic=5, lda_type='default', workers_n=2, random_seed=1)[source]

Train a single LDA Topic Model.

Parameters
  • num_topics (int (default: 5)) – A number of topics.

  • lda_type (str (default: ‘default’, {‘default’, ‘hdp’, ‘mallet’})) – A type of LDA model. Use ‘default’ for now. Other options are working in progress.

  • workers_n (int (default: 2)) – A number of CPU core to train.

  • random_seed (int (default: 1)) – A random seed int.

Example

>>> import unipy_nlp.data_collector as udcl
>>> import unipy_nlp.preprocessing as uprc
>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> from pprint import pprint
>>> prep = uprc.Preprocessor()
>>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json')
>>> sentence_for_pos_list = [
...     "무궁화 꽃이 피었습니다."
...     "우리는 민족중흥의 역사적 사명을 띠고 이 땅에 태어났다.",
... ]
>>> tokenized = prep.pos_tag(
...     input_text=sentence_for_pos_list,
...     tag_type=[
...         '체언 접두사', '명사', '한자', '외국어',
...         '수사', '구분자',
...         '동사',
...         '부정 지정사', '긍정 지정사',
...     ]
... )
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.train_lda(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
visualize_lda_to_html(target_topic_num, top_n=10, r_normalized=False, relevence_lambda_val=0.6, workers_n=2, random_seed=1, savepath='./', filename_affix='lda', save_relevent_terms_ok=True, save_html_ok=True, display_ok=False)[source]

Run pyLDAvis.prepare & get adjusted scores(use saliency & relevence) of terms by each topic.

Parameters
  • target_topic_num (int) – A topic number of LDA model to visualize.

  • top_n (int (default: 10)) – A number of the most relevent terms in a topic.

  • r_normalized (bool (default: False)) – Use normalized probabilities when it is True. (not recommended in most cases.)

  • relevence_lambda_val (float (defautl: .6).) – A lambda value(ratio) to calculate relevence.

  • workers_n (int (default: 2)) – A number of CPU cores to calculate(pyLDAvis.prepare)

  • random_seed (int (default: 1)) – A random seed number.

  • savepath (str (default: ‘./’)) – A dirpath to save pyLDAvis or other `pandas.DataFrame`s.

  • filename_affix (str (default: ‘lda’)) – An affix of filename to save pyLDAvis html or json.

  • save_relevent_terms_ok (bool (default: True)) – An option to save pandas.DataFrame of top_relevent_terms.

  • save_html_ok (bool (default: True)) – An option to save html.

  • display_ok (bool (default: False)) – Call pyLDAvis.display when it is True.

References

Saliency:

Chuang, J., 2012. Termite: Visualization techniques for assessing textual topic models

Relevence:

Sievert, C., 2014. LDAvis: A method for visualizing and interpreting topics

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> tpm.visualize_lda_to_html(
...     7,
...     top_n=10,
...     r_normalized=False,
...     relevence_lambda_val=.6,
...     workers_n=8,
...     random_seed=1,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
...     save_relevent_terms_ok=True,
...     save_html_ok=True,
...     display_ok=False,
... )

unipy_nlp.analyze.word2vec module

Word2Vec.

class unipy_nlp.analyze.word2vec.Word2Vec(tokenized_sentence_list)[source]

Bases: object

get_similar(words, top_n=2)[source]
load_w2v(filepath)[source]
save_tensorboard(dirpath=None)[source]
save_w2v(filepath)[source]
train_w2v(size=70, window=4, min_count=10, negative=16, workers=8, iter=50, sg=1)[source]

Module contents

Topic Modeling(LDA) & Word2Vec.