unipy_nlp.analyze package¶

Submodules¶

unipy_nlp.analyze.topic_modeling module¶

Topic Modeling(LDA) & Word2Vec.

class unipy_nlp.analyze.topic_modeling.TopicModeler(sentence_list, tokenized_sentence_list)[source]¶

Bases: object

Topic Modeling via LDA(Latent Diriclet Allocation).

Get tokenized from text.

Parameters

sentence_list (list) – A list of raw sentences.
tokenized_sentence_list (list) – A nested list of tokenized sentences.

After `__init__`

self.sentences:: A list of raw sentences.
self.tokenized: list: A nested list of tokenized sentences.
self.corpora_dict: gensim.corpora.dictionary.Dictionary: A token dictionary from a given text.
self.bow_corpus_idx: list: A nested list, which contains converted documents into a list of token indices.
self.bow_corpus_doc: list: A nested list, which contains converted documents into a list of token words.

After `train_lda` or `load_lda`

self.best_lda_model: dict: A dict contains the best model & its coherence value. {‘coherence’: int, ‘model’:gensim.models.ldamulticore.LdaMulticore}
self.lda_model_list = model_coh_list: A nested list of [topic_num, model, coherence_value]
self.lda_model_dict:: A nested dict as {topic_num: {‘coherence’: int, ‘model’: `gensim.models.ldamulticore.LdaMulticore}}`
self.trained: bool: True If trained or properly loaded.

After `visualize_lda_to_html`

self.selected_topic_num: int: A int of selected topic number.

self.selected_model: gensim.models.ldamulticore.LdaMulticore

self.vis_prepared: pyLDAvis.prepared_data.PreparedData

self.total_terms_df: tinfo_table, ‘Default’ removed.
self.top_relevant_terms_df: pandas.DataFrame: A rank table of Category.
self.r_adj_score_df: pandas.DataFrame: A tinfo table, considering saliency and relevence score.
self.bow_score_list: list: Scores of each sentence, based on bow_corpus, clipped by (0, 3).

After `estimate_topics_by_documents` or `load_estimated`

self.dominant_topic_estimation_df: pandas.DataFrame: A dataframe contains [‘lda_prob’, ‘dominant_topic’, ‘contribution’, ‘topic_keywords’]
self.topic_freq_df: pandas.DataFrame: A rank table by topic frequency.

After `get_representitive_documents` or `load_representitive_documents`: self.representitive_docs: pandas.DataFrame

After `get_representitive_candidates`: return repr_sentences, repr_bow_corpus_doc, repr_bow_corpus_idx

train_lda()[source]¶

save_lda()[source]¶

load_lda()[source]¶

pick_best_lda_topics()[source]¶

visualize_lda_to_html()[source]¶

estimate_topics_by_documents()[source]¶

get_representitive_documents()[source]¶

See also

Preprocessing: unipy_nlp.preprocessing.Preprocessor
POS-Tagging: konlpy.tag.Mecab
Byte-Pair: sentencepiece

Examples

>>> import unipy_nlp.data_collector as udcl
>>> import unipy_nlp.preprocessing as uprc
>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> from pprint import pprint
>>> prep = uprc.Preprocessor()
>>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json')
>>> sentence_for_pos_list = [
...     "무궁화 꽃이 피었습니다."
...     "우리는 민족중흥의 역사적 사명을 띠고 이 땅에 태어났다.",
... ]
>>> tokenized = prep.pos_tag(
...     input_text=sentence_for_pos_list,
...     tag_type=[
...         '체언 접두사', '명사', '한자', '외국어',
...         '수사', '구분자',
...         '동사',
...         '부정 지정사', '긍정 지정사',
...     ]
... )
>>> print(tokenized)
[['무궁화'], ['우리', '민족중흥', '역사', '사명']]
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.train_lda(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> tpm.save_lda(savepath='data/_tmp_dump/topic_modeling', affix='lda')
>>> tpm.load_lda('data/_tmp_dump/topic_modeling')
>>> tpm.pick_best_lda_topics(
...     num_topic_list=[5, 7, 10],
...     workers_n=8,
...     random_seed=1,
... )
>>> tpm.visualize_lda_to_html(
...     7,
...     top_n=10,
...     r_normalized=False,
...     relevence_lambda_val=.6,
...     workers_n=8,
...     random_seed=1,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
...     # save_type='html',  # {'html', 'json'}
...     save_relevent_terms_ok=True,
...     save_html_ok=True,
...     display_ok=False,
... )

>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     # sentence_list=tokenized,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> sentence_repr = tpm.get_representitive_documents(
...     7,
...     len_range=(10, 30),
...     top_n=10,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )

estimate_topics_by_documents(target_topic_num, random_seed=1, save_ok=True, savepath='./', filename_affix='lda')[source]

Get dominant topics & its contribution scores from each documents.

Parameters

target_topic_num (int) – A topic number of LDA model to use.
random_seed (int (default: 1)) – A random seed number.
save_ok (bool (default: True)) – Save return pandas.DataFrame.
savepath (str (default: ‘./’)) – A dirpath to save the topic-labeled sentences.
filename_affix (str (default: ‘lda’)) – An affix of filename to save the topic-labeled sentences.

Returns

dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.
topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )

get_best_n_terms()[source]¶

get_representitive_candidates(len_range=(10, 30))[source]¶

Get representitive candidates by length. It is for to use unipy_nlp.network_plot.

Parameters

len_range (list or tuple (default: (10, 30))) – A candidate threshold by length.

Returns

repr_sentences (list) – A list of sentences.
repr_bow_corpus_doc (list) – A nested list, which contains converted documents into a list of token words.
repr_bow_corpus_idx (list) – A nested list, which contains converted documents into a list of token indices..

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> (repr_sentenced,
>>>  repr_bow_corpus_doc,
>>>  repr_bow_corpus_idx) = tpm.get_representitive_candidates(
...     len_range=(12, 30),
... )

get_representitive_documents(target_topic_num, len_range=(10, 30), top_n=10, save_ok=True, savepath='./', filename_affix='lda')[source]

List-up the most representitive documents by topic.

Parameters

target_topic_num (int) – A topic number of LDA model to use.
len_range (list or tuple (default: (10, 30))) – A candidate threshold by length.
top_n (int (default: 10)) – A document number to list-up, by topic.
save_ok (bool (default: True)) – An option to save.
savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.
filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.

Returns

reordered – Representitive documents, group by topic, ordery by its rank.

Return type

pandas.DataFrame

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> sentence_repr = tpm.get_representitive_documents(
...     7,
...     len_range=(10, 30),
...     top_n=10,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )

load_estimated(target_topic_num, savepath='./', filename_affix='lda')[source]¶

Load the result of self.estimate_topics_by_documents.

Parameters

target_topic_num (int) – A topic number of LDA model to use.
savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.
filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.

Returns

dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.
topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> sentence_labeled, topic_freq = tpm.load_estimated(
...     target_topic_num=7,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )

load_lda(filepath)[source]

Load trained lda model(s).

Parameters: filepath (str) – A dirpath to load. It contains .ldamodel.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.load_lda('data/_tmp_dump/topic_modeling')

load_representitive_documents(target_topic_num, top_n=10, savepath='./', filename_affix='lda')[source]¶

Load the result of self.get_representitive_documents.

Parameters

target_topic_num (int) – A topic number of LDA model to use.
top_n (int (default: 10)) – A document number to list-up, by topic. The upper bound depends on how many documents saved.
savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.
filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.

Returns

dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.
topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> sentence_labeled = tpm.estimate_topics_by_documents(
...     7,
...     random_seed=1,
...     save_ok=True,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )
>>> sentence_labeled, topic_freq = tpm.load_estimated(
...     target_topic_num=7,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
... )

pick_best_lda_topics(num_topic_list=[5, 7, 10, 12, 15, 17, 20], lda_type='default', workers_n=2, random_seed=1)[source]

Train multiple LDA Topic Models by given topic numbers.

Parameters

num_topic_list (list (default: [5, 7, 10, 12, 15, 17, 20])) – A list of topic numbers.
lda_type (str (default: ‘default’, {‘default’, ‘hdp’, ‘mallet’})) – A type of LDA model. Use ‘default’ for now. Other options are working in progress.
workers_n (int (default: 2)) – A number of CPU core to train.
random_seed (int (default: 1)) – A random seed int.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )

save_lda(savepath='./', affix='lda')[source]

Save trained lda model(s).

Parameters

savepath (str (default: ‘./’)) – A dirpath to save.
affix (str (default: ‘lda’)) – An affix of filename. Its ext will be .ldamodel.

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> tpm.save_lda(savepath='data/_tmp_dump/topic_modeling', affix='lda')

train_lda(num_topic=5, lda_type='default', workers_n=2, random_seed=1)[source]

Train a single LDA Topic Model.

Parameters

num_topics (int (default: 5)) – A number of topics.
lda_type (str (default: ‘default’, {‘default’, ‘hdp’, ‘mallet’})) – A type of LDA model. Use ‘default’ for now. Other options are working in progress.
workers_n (int (default: 2)) – A number of CPU core to train.
random_seed (int (default: 1)) – A random seed int.

Example

>>> import unipy_nlp.data_collector as udcl
>>> import unipy_nlp.preprocessing as uprc
>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> from pprint import pprint
>>> prep = uprc.Preprocessor()
>>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json')
>>> sentence_for_pos_list = [
...     "무궁화 꽃이 피었습니다."
...     "우리는 민족중흥의 역사적 사명을 띠고 이 땅에 태어났다.",
... ]
>>> tokenized = prep.pos_tag(
...     input_text=sentence_for_pos_list,
...     tag_type=[
...         '체언 접두사', '명사', '한자', '외국어',
...         '수사', '구분자',
...         '동사',
...         '부정 지정사', '긍정 지정사',
...     ]
... )
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.train_lda(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )

visualize_lda_to_html(target_topic_num, top_n=10, r_normalized=False, relevence_lambda_val=0.6, workers_n=2, random_seed=1, savepath='./', filename_affix='lda', save_relevent_terms_ok=True, save_html_ok=True, display_ok=False)[source]

Run pyLDAvis.prepare & get adjusted scores(use saliency & relevence) of terms by each topic.

Parameters

target_topic_num (int) – A topic number of LDA model to visualize.
top_n (int (default: 10)) – A number of the most relevent terms in a topic.
r_normalized (bool (default: False)) – Use normalized probabilities when it is True. (not recommended in most cases.)
relevence_lambda_val (float (defautl: .6).) – A lambda value(ratio) to calculate relevence.
workers_n (int (default: 2)) – A number of CPU cores to calculate(pyLDAvis.prepare)
random_seed (int (default: 1)) – A random seed number.
savepath (str (default: ‘./’)) – A dirpath to save pyLDAvis or other `pandas.DataFrame`s.
filename_affix (str (default: ‘lda’)) – An affix of filename to save pyLDAvis html or json.
save_relevent_terms_ok (bool (default: True)) – An option to save pandas.DataFrame of top_relevent_terms.
save_html_ok (bool (default: True)) – An option to save html.
display_ok (bool (default: False)) – Call pyLDAvis.display when it is True.

References

Saliency:: Chuang, J., 2012. Termite: Visualization techniques for assessing textual topic models
Relevence:: Sievert, C., 2014. LDAvis: A method for visualizing and interpreting topics

Example

>>> import unipy_nlp.analyze.topic_modeling as utpm
>>> tpm = utpm.TopicModeler(sentence_list, tokenized)
>>> tpm.pick_best_lda_topics(
...     num_topic=5,
...     workers_n=8,
...     random_seed=1,
... )
>>> tpm.visualize_lda_to_html(
...     7,
...     top_n=10,
...     r_normalized=False,
...     relevence_lambda_val=.6,
...     workers_n=8,
...     random_seed=1,
...     savepath='data/_tmp_dump/topic_modeling',
...     filename_affix='lda',
...     save_relevent_terms_ok=True,
...     save_html_ok=True,
...     display_ok=False,
... )

unipy_nlp.analyze.word2vec module¶

Word2Vec.

class unipy_nlp.analyze.word2vec.Word2Vec(tokenized_sentence_list)[source]¶

Bases: object

get_similar(words, top_n=2)[source]¶

load_w2v(filepath)[source]¶

save_tensorboard(dirpath=None)[source]¶

save_w2v(filepath)[source]¶

train_w2v(size=70, window=4, min_count=10, negative=16, workers=8, iter=50, sg=1)[source]¶

Module contents¶

Topic Modeling(LDA) & Word2Vec.