unipy_nlp.analyze package¶
Submodules¶
unipy_nlp.analyze.topic_modeling module¶
Topic Modeling(LDA) & Word2Vec.
-
class
unipy_nlp.analyze.topic_modeling.TopicModeler(sentence_list, tokenized_sentence_list)[source]¶ Bases:
objectTopic Modeling via LDA(Latent Diriclet Allocation).
Get tokenized from text.
- Parameters
sentence_list (list) – A list of raw sentences.
tokenized_sentence_list (list) – A nested list of tokenized sentences.
-
After `__init__` - self.sentences:
A list of raw sentences.
- self.tokenized: list
A nested list of tokenized sentences.
- self.corpora_dict: gensim.corpora.dictionary.Dictionary
A token dictionary from a given text.
- self.bow_corpus_idx: list
A nested list, which contains converted documents into a list of token indices.
- self.bow_corpus_doc: list
A nested list, which contains converted documents into a list of token words.
-
After `train_lda` or `load_lda` - self.best_lda_model: dict
A dict contains the best model & its coherence value. {‘coherence’: int, ‘model’:gensim.models.ldamulticore.LdaMulticore}
- self.lda_model_list = model_coh_list
A nested list of [topic_num, model, coherence_value]
- self.lda_model_dict:
A nested dict as {topic_num: {‘coherence’: int, ‘model’: `gensim.models.ldamulticore.LdaMulticore}}`
- self.trained: bool
True If trained or properly loaded.
-
After `visualize_lda_to_html` - self.selected_topic_num: int
A int of selected topic number.
self.selected_model: gensim.models.ldamulticore.LdaMulticore
self.vis_prepared: pyLDAvis.prepared_data.PreparedData
- self.total_terms_df
tinfo_table, ‘Default’ removed.
- self.top_relevant_terms_df: pandas.DataFrame
A rank table of Category.
- self.r_adj_score_df: pandas.DataFrame
A tinfo table, considering saliency and relevence score.
- self.bow_score_list: list
Scores of each sentence, based on bow_corpus, clipped by (0, 3).
-
After `estimate_topics_by_documents` or `load_estimated` - self.dominant_topic_estimation_df: pandas.DataFrame
A dataframe contains [‘lda_prob’, ‘dominant_topic’, ‘contribution’, ‘topic_keywords’]
- self.topic_freq_df: pandas.DataFrame
A rank table by topic frequency.
-
After `get_representitive_documents` or `load_representitive_documents` self.representitive_docs: pandas.DataFrame
-
After `get_representitive_candidates` return repr_sentences, repr_bow_corpus_doc, repr_bow_corpus_idx
See also
Preprocessingunipy_nlp.preprocessing.PreprocessorPOS-Taggingkonlpy.tag.MecabByte-Pairsentencepiece
Examples
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> from pprint import pprint >>> prep = uprc.Preprocessor() >>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json') >>> sentence_for_pos_list = [ ... "무궁화 꽃이 피었습니다." ... "우리는 민족중흥의 역사적 사명을 띠고 이 땅에 태어났다.", ... ] >>> tokenized = prep.pos_tag( ... input_text=sentence_for_pos_list, ... tag_type=[ ... '체언 접두사', '명사', '한자', '외국어', ... '수사', '구분자', ... '동사', ... '부정 지정사', '긍정 지정사', ... ] ... ) >>> print(tokenized) [['무궁화'], ['우리', '민족중흥', '역사', '사명']] >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... ) >>> tpm.save_lda(savepath='data/_tmp_dump/topic_modeling', affix='lda') >>> tpm.load_lda('data/_tmp_dump/topic_modeling') >>> tpm.pick_best_lda_topics( ... num_topic_list=[5, 7, 10], ... workers_n=8, ... random_seed=1, ... ) >>> tpm.visualize_lda_to_html( ... 7, ... top_n=10, ... r_normalized=False, ... relevence_lambda_val=.6, ... workers_n=8, ... random_seed=1, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... # save_type='html', # {'html', 'json'} ... save_relevent_terms_ok=True, ... save_html_ok=True, ... display_ok=False, ... )
>>> sentence_labeled = tpm.estimate_topics_by_documents( ... 7, ... # sentence_list=tokenized, ... random_seed=1, ... save_ok=True, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... ) >>> sentence_repr = tpm.get_representitive_documents( ... 7, ... len_range=(10, 30), ... top_n=10, ... save_ok=True, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... )
-
estimate_topics_by_documents(target_topic_num, random_seed=1, save_ok=True, savepath='./', filename_affix='lda')[source] Get dominant topics & its contribution scores from each documents.
- Parameters
target_topic_num (int) – A topic number of LDA model to use.
random_seed (int (default: 1)) – A random seed number.
save_ok (bool (default: True)) – Save return pandas.DataFrame.
savepath (str (default: ‘./’)) – A dirpath to save the topic-labeled sentences.
filename_affix (str (default: ‘lda’)) – An affix of filename to save the topic-labeled sentences.
- Returns
dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.
topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.pick_best_lda_topics( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... ) >>> sentence_labeled = tpm.estimate_topics_by_documents( ... 7, ... random_seed=1, ... save_ok=True, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... )
-
get_representitive_candidates(len_range=(10, 30))[source]¶ Get representitive candidates by length. It is for to use unipy_nlp.network_plot.
- Parameters
len_range (list or tuple (default: (10, 30))) – A candidate threshold by length.
- Returns
repr_sentences (list) – A list of sentences.
repr_bow_corpus_doc (list) – A nested list, which contains converted documents into a list of token words.
repr_bow_corpus_idx (list) – A nested list, which contains converted documents into a list of token indices..
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.pick_best_lda_topics( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... ) >>> sentence_labeled = tpm.estimate_topics_by_documents( ... 7, ... random_seed=1, ... save_ok=True, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... ) >>> (repr_sentenced, >>> repr_bow_corpus_doc, >>> repr_bow_corpus_idx) = tpm.get_representitive_candidates( ... len_range=(12, 30), ... )
-
get_representitive_documents(target_topic_num, len_range=(10, 30), top_n=10, save_ok=True, savepath='./', filename_affix='lda')[source] List-up the most representitive documents by topic.
- Parameters
target_topic_num (int) – A topic number of LDA model to use.
len_range (list or tuple (default: (10, 30))) – A candidate threshold by length.
top_n (int (default: 10)) – A document number to list-up, by topic.
save_ok (bool (default: True)) – An option to save.
savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.
filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.
- Returns
reordered – Representitive documents, group by topic, ordery by its rank.
- Return type
pandas.DataFrame
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.pick_best_lda_topics( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... ) >>> sentence_labeled = tpm.estimate_topics_by_documents( ... 7, ... random_seed=1, ... save_ok=True, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... ) >>> sentence_repr = tpm.get_representitive_documents( ... 7, ... len_range=(10, 30), ... top_n=10, ... save_ok=True, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... )
-
load_estimated(target_topic_num, savepath='./', filename_affix='lda')[source]¶ Load the result of self.estimate_topics_by_documents.
- Parameters
target_topic_num (int) – A topic number of LDA model to use.
savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.
filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.
- Returns
dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.
topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.pick_best_lda_topics( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... ) >>> sentence_labeled = tpm.estimate_topics_by_documents( ... 7, ... random_seed=1, ... save_ok=True, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... ) >>> sentence_labeled, topic_freq = tpm.load_estimated( ... target_topic_num=7, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... )
-
load_lda(filepath)[source] Load trained lda model(s).
- Parameters
filepath (str) – A dirpath to load. It contains .ldamodel.
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.load_lda('data/_tmp_dump/topic_modeling')
-
load_representitive_documents(target_topic_num, top_n=10, savepath='./', filename_affix='lda')[source]¶ Load the result of self.get_representitive_documents.
- Parameters
target_topic_num (int) – A topic number of LDA model to use.
top_n (int (default: 10)) – A document number to list-up, by topic. The upper bound depends on how many documents saved.
savepath (str (default: ‘./’)) – A dirpath to load the topic-labeled sentences.
filename_affix (str (default: ‘lda’)) – An affix of filename to load the topic-labeled sentences.
- Returns
dominant_topic_estimation_df (pandas.DataFrame) – Topic-labeled given(trained) sentences.
topic_freq_df (pandas.DataFrame) – A rank table of topics by frequency.
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.pick_best_lda_topics( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... ) >>> sentence_labeled = tpm.estimate_topics_by_documents( ... 7, ... random_seed=1, ... save_ok=True, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... ) >>> sentence_labeled, topic_freq = tpm.load_estimated( ... target_topic_num=7, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... )
-
pick_best_lda_topics(num_topic_list=[5, 7, 10, 12, 15, 17, 20], lda_type='default', workers_n=2, random_seed=1)[source] Train multiple LDA Topic Models by given topic numbers.
- Parameters
num_topic_list (list (default: [5, 7, 10, 12, 15, 17, 20])) – A list of topic numbers.
lda_type (str (default: ‘default’, {‘default’, ‘hdp’, ‘mallet’})) – A type of LDA model. Use ‘default’ for now. Other options are working in progress.
workers_n (int (default: 2)) – A number of CPU core to train.
random_seed (int (default: 1)) – A random seed int.
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.pick_best_lda_topics( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... )
-
save_lda(savepath='./', affix='lda')[source] Save trained lda model(s).
- Parameters
savepath (str (default: ‘./’)) – A dirpath to save.
affix (str (default: ‘lda’)) – An affix of filename. Its ext will be .ldamodel.
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.pick_best_lda_topics( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... ) >>> tpm.save_lda(savepath='data/_tmp_dump/topic_modeling', affix='lda')
-
train_lda(num_topic=5, lda_type='default', workers_n=2, random_seed=1)[source] Train a single LDA Topic Model.
- Parameters
num_topics (int (default: 5)) – A number of topics.
lda_type (str (default: ‘default’, {‘default’, ‘hdp’, ‘mallet’})) – A type of LDA model. Use ‘default’ for now. Other options are working in progress.
workers_n (int (default: 2)) – A number of CPU core to train.
random_seed (int (default: 1)) – A random seed int.
Example
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> from pprint import pprint >>> prep = uprc.Preprocessor() >>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json') >>> sentence_for_pos_list = [ ... "무궁화 꽃이 피었습니다." ... "우리는 민족중흥의 역사적 사명을 띠고 이 땅에 태어났다.", ... ] >>> tokenized = prep.pos_tag( ... input_text=sentence_for_pos_list, ... tag_type=[ ... '체언 접두사', '명사', '한자', '외국어', ... '수사', '구분자', ... '동사', ... '부정 지정사', '긍정 지정사', ... ] ... ) >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... )
-
visualize_lda_to_html(target_topic_num, top_n=10, r_normalized=False, relevence_lambda_val=0.6, workers_n=2, random_seed=1, savepath='./', filename_affix='lda', save_relevent_terms_ok=True, save_html_ok=True, display_ok=False)[source] Run pyLDAvis.prepare & get adjusted scores(use saliency & relevence) of terms by each topic.
- Parameters
target_topic_num (int) – A topic number of LDA model to visualize.
top_n (int (default: 10)) – A number of the most relevent terms in a topic.
r_normalized (bool (default: False)) – Use normalized probabilities when it is True. (not recommended in most cases.)
relevence_lambda_val (float (defautl: .6).) – A lambda value(ratio) to calculate relevence.
workers_n (int (default: 2)) – A number of CPU cores to calculate(pyLDAvis.prepare)
random_seed (int (default: 1)) – A random seed number.
savepath (str (default: ‘./’)) – A dirpath to save pyLDAvis or other `pandas.DataFrame`s.
filename_affix (str (default: ‘lda’)) – An affix of filename to save pyLDAvis html or json.
save_relevent_terms_ok (bool (default: True)) – An option to save pandas.DataFrame of top_relevent_terms.
save_html_ok (bool (default: True)) – An option to save html.
display_ok (bool (default: False)) – Call pyLDAvis.display when it is True.
References
- Saliency:
Chuang, J., 2012. Termite: Visualization techniques for assessing textual topic models
- Relevence:
Sievert, C., 2014. LDAvis: A method for visualizing and interpreting topics
Example
>>> import unipy_nlp.analyze.topic_modeling as utpm >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.pick_best_lda_topics( ... num_topic=5, ... workers_n=8, ... random_seed=1, ... ) >>> tpm.visualize_lda_to_html( ... 7, ... top_n=10, ... r_normalized=False, ... relevence_lambda_val=.6, ... workers_n=8, ... random_seed=1, ... savepath='data/_tmp_dump/topic_modeling', ... filename_affix='lda', ... save_relevent_terms_ok=True, ... save_html_ok=True, ... display_ok=False, ... )
unipy_nlp.analyze.word2vec module¶
Word2Vec.
Module contents¶
Topic Modeling(LDA) & Word2Vec.