unipy_nlp package¶
Subpackages¶
- unipy_nlp.analyze package
- unipy_nlp.test package
- Submodules
- unipy_nlp.test.hunspell_desc module
- unipy_nlp.test.test_analyzer module
- unipy_nlp.test.test_data_collector module
- unipy_nlp.test.test_network_plot module
- unipy_nlp.test.test_preprocessing module
- unipy_nlp.test.test_tagger module
- unipy_nlp.test.test_topic_modeling module
- unipy_nlp.test.test_word2vec module
- Module contents
Submodules¶
unipy_nlp.data_collector module¶
Get Data from xlsx.
-
unipy_nlp.data_collector.refine_nested_excel_to_dict(xlsx_loaded) → pandas.core.frame.DataFrame[source]¶
-
unipy_nlp.data_collector.collect_data(filepath, dump_filepath=None, dump_json_ok=True, return_tuple=True)[source]¶ Summary
This function is for to gather text from xslx rawdata. Not designed for general uses.
- Parameters
filepath (str) – A directory xslx file(s) exists.
dump_json_ok (bool (default: True)) – True if
how ({'equal', 'remaining'}) – The method to split. ‘equal’ is to split chunks with the approximate length within the given size. ‘remaining’ is to split chunks with the given size, and the remains are bound as the last chunk.
size (int) – The number of chunks.
- Returns
A list of chunks.
- Return type
list
Examples
>>> up.splitter(list(range(10)), how='equal', size=3) [(0, 1, 2, 3), (4, 5, 6), (7, 8, 9)]
>>> up.splitter(list(range(10)), how='remaining', size=3) [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9,)]
unipy_nlp.network_plot module¶
A N-gram network plot.
-
class
unipy_nlp.network_plot.WordNetwork(topic_freq_df, top_relevant_terms_df)[source]¶ Bases:
objectA network plot of co-occurance of words.
- Parameters
topic_freq_df (list) – A rank table by topic frequency.
top_relevant_terms_df (list) – A rank table of Category.
-
pyvis_net¶ - Type
pyvis.network.Network
-
ngramed_list¶ - Type
list
-
ngramed_df¶ - Type
pandas.DataFrame
See also
Preprocessingunipy_nlp.preprocessing.PreprocessorTopicunipy_nlp.analyze.topic_modeling.Topic_modelerPOS-Taggingkonlpy.tag.MecabByte-Pairsentencepiece
Examples
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> import unipy_nlp.network_plot as unet >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda(...) >>> tpm.visualize_lda_to_html(...) >>> vnet = unet.WordNetwork( ... topic_freq_df=tpm.topic_freq_df, ... top_relevant_terms_df=tpm.top_relevant_terms_df, ... ) >>> vnet.get_ngram(tokenized) >>> vnet.save_ngram('data/_tmp_dump/network_plot/ngram.json', type='json') >>> vnet.save_ngram('data/_tmp_dump/network_plot/ngram.csv', type='csv') >>> vnet.load_ngram('data/_tmp_dump/network_plot/ngram.json', type='json') >>> vnet.load_ngram('data/_tmp_dump/network_plot/ngram.csv', type='csv') >>> vnet.draw( ... height="100%", ... width='800px', ... bgcolor='#ffffff', ... font_color='black', ... directed=True, ... topic_top_n=5, ... node_freq_threshold=100, ... show_buttons=True, ... ) >>> (score_dict, ... score_dict_indiced) = vnet.get_topic_mutuality_score_dict( ... cdict=tpm.corpora_dict ... ) >>> core_repr = vnet.get_network_scored_repr_docs( ... bow_corpus=repr_bow_corpus_doc, ... repr_docs=repr_sentenced, ... save_ok=True, ... savepath=None, ... )
-
draw(height='700px', width='800px', bgcolor='#ffffff', font_color='black', directed=True, notebook=False, topic_top_n=None, node_freq_threshold=None, show_buttons=True)[source] Draw pyvis.network.Network using N-grams.
- Parameters
height (str (default: “700px”)) –
Height of the network plot. It can be pixel-based or percentage-based.
- width: str (default: “800px”)
Height of the network plot. It can be pixel-based or percentage-based.
bgcolor (str (default: ‘#ffffff’)) – HEX color for background.
font_color (str (default: ‘black’)) – HEX color or colorname for font.
directed (bool (default: True)) – An option to show direction for each edges.
notebook (bool (default: False)) – An option to show in jupyter notebook
topic_top_n (int (default: None)) – A topic number to show. It depends on its frequency.
node_freq_threshold (int (default: None)) – A threshold number to show nodes. It is useful when your nodes & edges are too many to show.
show_buttons (bool (default: True)) – An option to show interactive buttons in html.
Example
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> import unipy_nlp.network_plot as unet >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda(...) >>> tpm.visualize_lda_to_html(...) >>> vnet = unet.WordNetwork( ... topic_freq_df=tpm.topic_freq_df, ... top_relevant_terms_df=tpm.top_relevant_terms_df, ... ) >>> vnet.get_ngram(tokenized) >>> vnet.draw( ... height="100%", ... width='800px', ... bgcolor='#ffffff', ... font_color='black', ... directed=True, ... topic_top_n=5, ... node_freq_threshold=100, ... show_buttons=True, ... )
-
get_network_scored_repr_docs(bow_corpus, repr_docs, save_ok=True, filepath=None)[source] Get representitive documents, based on the mutuality score of terms.
- Parameters
bow_corpus (list) – A nested list, which contains converted documents into a list of token words.
repr_docs (list) – A list of raw documents.
save_ok (bool (default: True)) – An option to save.
filepath (str (default: None)) – A filepath to save.
Examples
Example
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> import unipy_nlp.network_plot as unet >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda(...) >>> tpm.visualize_lda_to_html(...) >>> vnet = unet.WordNetwork( ... topic_freq_df=tpm.topic_freq_df, ... top_relevant_terms_df=tpm.top_relevant_terms_df, ... ) >>> vnet.get_ngram(tokenized) >>> vnet.draw( ... height="100%", ... width='800px', ... bgcolor='#ffffff', ... font_color='black', ... directed=True, ... topic_top_n=5, ... node_freq_threshold=100, ... show_buttons=True, ... ) >>> vnet.save('data/_tmp_dump/network_plot/vnet.html') >>> (score_dict, ... score_dict_indiced) = vnet.get_topic_mutuality_score_dict( ... cdict=tpm.corpora_dict ... ) >>> core_repr = vnet.get_network_scored_repr_docs( ... bow_corpus=repr_bow_corpus_doc, ... repr_docs=repr_sentenced, ... save_ok=True, ... savepath=None, ... )
-
get_ngram(tokenized_sentence_list)[source] Get N-grams for nodes & edges.
- Parameters
tokenized_sentence_list (list) – A list of tokenized documents.
Examples
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> import unipy_nlp.network_plot as unet >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda(...) >>> tpm.visualize_lda_to_html(...) >>> vnet = unet.WordNetwork( ... topic_freq_df=tpm.topic_freq_df, ... top_relevant_terms_df=tpm.top_relevant_terms_df, ... ) >>> vnet.get_ngram(tokenized)
-
get_topic_mutuality_score_dict(cdict)[source] Get scores of terms, based on its mutuality.
- Parameters
cdict (gensim.corpora.dictionary.Dictionary) – A corpus dictionary for given documents.
Examples
Example
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> import unipy_nlp.network_plot as unet >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda(...) >>> tpm.visualize_lda_to_html(...) >>> vnet = unet.WordNetwork( ... topic_freq_df=tpm.topic_freq_df, ... top_relevant_terms_df=tpm.top_relevant_terms_df, ... ) >>> vnet.get_ngram(tokenized) >>> vnet.draw( ... height="100%", ... width='800px', ... bgcolor='#ffffff', ... font_color='black', ... directed=True, ... topic_top_n=5, ... node_freq_threshold=100, ... show_buttons=True, ... ) >>> vnet.save('data/_tmp_dump/network_plot/vnet.html') >>> (score_dict, ... score_dict_indiced) = vnet.get_topic_mutuality_score_dict( ... cdict=tpm.corpora_dict ... )
-
load_ngram(filename, type='json')[source] Load N-grams.
- Parameters
filepath (str) – A filepath to save.
type (str (default: ‘json’, {‘json’, ‘csv’})) – Choose file type.
Examples
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> import unipy_nlp.network_plot as unet >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda(...) >>> tpm.visualize_lda_to_html(...) >>> vnet = unet.WordNetwork( ... topic_freq_df=tpm.topic_freq_df, ... top_relevant_terms_df=tpm.top_relevant_terms_df, ... ) >>> vnet.get_ngram(tokenized) >>> vnet.save_ngram('data/_tmp_dump/network_plot/ngram.json', type='json') >>> vnet.load_ngram('data/_tmp_dump/network_plot/ngram.json', type='json')
-
save(filepath_html)[source]¶ Save pyvis.network.Network.
- Parameters
filepath_html (str) – A filepath to save.
Examples
Example
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> import unipy_nlp.network_plot as unet >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda(...) >>> tpm.visualize_lda_to_html(...) >>> vnet = unet.WordNetwork( ... topic_freq_df=tpm.topic_freq_df, ... top_relevant_terms_df=tpm.top_relevant_terms_df, ... ) >>> vnet.get_ngram(tokenized) >>> vnet.draw( ... height="100%", ... width='800px', ... bgcolor='#ffffff', ... font_color='black', ... directed=True, ... topic_top_n=5, ... node_freq_threshold=100, ... show_buttons=True, ... ) >>> vnet.save('data/_tmp_dump/network_plot/vnet.html')
-
save_ngram(filepath, type='json')[source] Save N-grams.
- Parameters
filepath (str) – A filepath to save.
type (str (default: ‘json’, {‘json’, ‘csv’})) – Choose file type.
Examples
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> import unipy_nlp.analyze.topic_modeling as utpm >>> import unipy_nlp.network_plot as unet >>> tpm = utpm.TopicModeler(sentence_list, tokenized) >>> tpm.train_lda(...) >>> tpm.visualize_lda_to_html(...) >>> vnet = unet.WordNetwork( ... topic_freq_df=tpm.topic_freq_df, ... top_relevant_terms_df=tpm.top_relevant_terms_df, ... ) >>> vnet.get_ngram(tokenized) >>> vnet.save_ngram('data/_tmp_dump/network_plot/ngram.json', type='json')
unipy_nlp.preprocessing module¶
Tokenize text with sentencepiece & MeCab, from xlsx & Elasticsearch.
-
class
unipy_nlp.preprocessing.Preprocessor(tagger='mecab')[source]¶ Bases:
objectText Preprocessing with POS-Tagging or Byte-Pair Encoding.
Get tokenized from text.
- Parameters
tagger (str {‘mecab’,}) – A POS-Tagging Engine to use.
-
source_sentences¶ sentences from text, given by read_json or read_es.
- Type
list
See also
POS-Taggingkonlpy.tag.MecabByte-Pairsentencepiece
Examples
>>> import unipy_nlp.data_collector as udcl >>> import unipy_nlp.preprocessing as uprc >>> from pprint import pprint >>> prep = uprc.Preprocessor() >>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json') >>> sentence_for_pos_list = [ ... "무궁화 꽃이 피었습니다." ... "우리는 민족중흥의 역사적 사명을 띠고 이 땅에 태어났다.", ... ] >>> tokenized_morphed_filtered = prep.pos_tag( ... input_text=sentence_for_pos_list, ... tag_type=[ ... '체언 접두사', '명사', '한자', '외국어', ... '수사', '구분자', ... '동사', ... '부정 지정사', '긍정 지정사', ... ] ... ) >>> print(tokenized_morphed_filtered) [['무궁화'], ['우리', '민족중흥', '역사', '사명']] >>> prep.train_spm( ... source_type='list', ... model_type='bpe', ... vocab_size=30000, ... model_name='spm_trained', ... savepath='./data/_tmp_dump/spmed', ... random_seed=1, ... ) >>> prep.load_spm( ... savepath='./data/_tmp_dump/spmed', ... model_name='spm_trained', ... use_bos=False, ... use_eos=False, ... vocab_min_freq_threshold=None, ... ) >>> sentence_for_spm_list = [ ... "새로운 기술환경의 발전과 확산이 진행되는 it환경", ... "비즈니스 환경과의 접목에 집중해 새로운 사업영역 선점", ... ] >>> tokenized_spmed = prep.spm_encode( ... sentence_for_spm_list, ... type='piece', ... rm_space=True, ... ) >>> pprint(tokenized_spmed) [['새로운', '기술', '환경의', '발전과', '확산이', '진행되는', 'it', '환경'], ['비즈니스', '환경', '과의', '접목', '에', '집중', '해', '새로운', '사업영역', '선점'],
-
load_spm(savepath='./data', model_name=None, use_bos=False, use_eos=False, vocab_min_freq_threshold=None)[source] A high-level wrapper for sentencepiece.SentencePieceTrainer.Load.
- Parameters
savepath (str (default: ‘./data’)) – A dirpath to load.
model_name (str (default: ‘spm_trained’)) – A filename prefix to load.
use_bos (bool (default: False)) – An option of SetEncodeExtraOptions.
use_eos (bool (default: False)) – An option of SetEncodeExtraOptions.
vocab_min_freq_threshold (int (default: None)) – An lower bound of vocabulary by its frequency.
Example
>>> import unipy_nlp.preprocessing as uprc >>> prep = uprc.Preprocessor() >>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json') >>> prep.load_spm( ... savepath='./data/_tmp_dump/spmed', ... model_name='spm_trained', ... use_bos=False, ... use_eos=False, ... vocab_min_freq_threshold=None, ... )
-
pos_tag(input_text=None, tag_type=None)[source] POS-Tagging with input_text or pre-loaded sentences.
- Parameters
input_text (list (default: None)) – A list of sentences. If None, use self.source_sentences internally.
tag_type (list (default: None)) – A tag name to subset. You can use ‘NNP’ or ‘일반 명사’ either.
- Returns
tokenized
- Return type
list
Example
>>> import unipy_nlp.preprocessing as uprc >>> ES_HOST = '52.78.243.101' >>> ES_PORT = '9200' >>> prep = uprc.Preprocessor() >>> sentence_for_pos_list = [ ... "무궁화 꽃이 피었습니다.", ... "우리는 민족중흥의 역사적 사명을 띠고 이 땅에 태어났다.", ... ] >>> tokenized = prep.pos_tag( ... input_text=sentence_for_pos_list, ... tag_type=[ ... '체언 접두사', '명사', '한자', '외국어', ... '수사', '구분자', ... '동사', ... '부정 지정사', '긍정 지정사', ... 'NNP', 'NNG', ... ] ... ) >>> print(tokenized) [['무궁화'], ['우리', '민족중흥', '역사', '사명']]
-
read_es(host, port, index='happymap_temp', match_as_flat_dict=None, key='contents', drop_min=2)[source] Read sentences from Elasticsearch, as self.source_sentences.
- Parameters
host (str) – A domain address of Elasticsearch server.
port (str) – A port number of Elasticsearch server.
index (str) – An index of Elasticsearch server.
match_as_flat_dict (str (default: None)) –
An option to query_match. match_all If None. Example: ``` match_as_flat_dict={
’sheet_nm’: ‘2019’, ‘table_nm’: ‘board’,
key (str) – A key of sentences in an object.
drop_min (int (default: 2)) – A lower bond of sentence length. If an inappropriate value is given, it will be changed by 1 systemically.
Example
>>> import unipy_nlp.preprocessing as uprc >>> ES_HOST = '52.78.243.101' >>> ES_PORT = '9200' >>> prep = uprc.Preprocessor() >>> prep.read_es( ... host=ES_HOST, ... port=ES_PORT, ... index='logs', ... match_as_flat_dict={ ... 'sheet_nm': '2019', ... 'table_nm': 'board', ... }, ... key='contents', ... drop_min=2, ... ) >>> prep.source_sentences[:2] ['새로운 기술환경의 발전과 확산이 진행되는 it환경', '비즈니스 환경과의 접목에 집중해 새로운 사업영역 선점']
-
read_json(filename, key='contents', drop_min=2)[source] Read sentences from disk, as self.source_sentences.
- Parameters
filename (str) – A filepath to read.
key (str) – A key of sentences in json object.
drop_min (int (default: 2)) – A lower bond of sentence length. If an inappropriate value is given, it will be changed by 1 systemically.
Example
>>> import unipy_nlp.preprocessing as uprc >>> prep = uprc.Preprocessor() >>> prep.read_json( ... './data/_tmp_dump/prep/rawdata_collected.json', ... key='contents', ... drop_min=2, ... ) >>> prep.source_sentences[:2] ['새로운 기술환경의 발전과 확산이 진행되는 it환경', '비즈니스 환경과의 접목에 집중해 새로운 사업영역 선점']
-
spm_encode(input_list, type='piece', rm_space=True)[source] A high-level wrapper for sentencepiece.EncodeAsPieces or sentencepiece.EncodeAsIds.
- Parameters
input_list (list) – A list of sentences to tokenize.
type (str (default: ‘piece’, {‘piece’, ‘id’})) – Choose encoding type. ‘piece’: str, ‘id’: int
rm_space (bool (default: True)) – An option to remove “▁” (U+2581), which represents the whitespace.
Example
>>> import unipy_nlp.preprocessing as uprc >>> prep = uprc.Preprocessor() >>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json') >>> prep.load_spm( ... savepath='./data/_tmp_dump/spmed', ... model_name='spm_trained', ... use_bos=False, ... use_eos=False, ... vocab_min_freq_threshold=None, ... ) >>> sentence_for_spm_list = [ ... "새로운 기술환경의 발전과 확산이 진행되는 it환경", ... "비즈니스 환경과의 접목에 집중해 새로운 사업영역 선점", ... ] >>> tokenized_spmed = prep.spm_encode( ... sentence_for_spm_list, ... type='piece', ... rm_space=True, ... ) >>> pprint(tokenized_spmed) [['새로운', '기술', '환경의', '발전과', '확산이', '진행되는', 'it', '환경'], ['비즈니스', '환경', '과의', '접목', '에', '집중', '해', '새로운', '사업영역', '선점'],
-
train_spm(source_type='list', source_file=None, model_type='bpe', vocab_size=50000, model_name='spm_trained', savepath='./data', random_seed=None)[source] A high-level wrapper for sentencepiece.SentencePieceTrainer.Train.
- Parameters
source_type (str (default: ‘list’)) – list: Use self.source_sentences as an input. txt: Use a given text file as an input. It should be split by a sentence.
model_type (str (default: ‘bpe’, {‘bpe’, ‘word’, ‘char’, ‘unigram’})) – A model_type of sentencepiece.
vocab_size (int (default: 50000)) – Embedding size of sentencepiece.
model_name (str (default: ‘spm_trained’)) – A filename prefix to save.
savepath (str (default: ‘./data’)) – A dirpath to save.
random_seed (int (default: None)) – A random seed number.
Example
>>> import unipy_nlp.preprocessing as uprc >>> prep = uprc.Preprocessor() >>> prep.read_json('./data/_tmp_dump/prep/rawdata_collected.json') >>> prep.train_spm( ... source_type='list', ... model_type='bpe', ... vocab_size=30000, ... model_name='spm_trained', ... savepath='./data/_tmp_dump/spmed', ... random_seed=1, ... )
unipy_nlp.tagger module¶
-
class
unipy_nlp.tagger.Mecab(dicpath='/home/docs/checkouts/readthedocs.org/user_builds/unipy-nlp/checkouts/latest/unipy_nlp/_resources/mecab/mecab/lib/mecab/dic/mecab-ko-dic')[source]¶ Bases:
objectWrapper for MeCab-ko morphological analyzer.
MeCab, originally a Japanese morphological analyzer and POS tagger developed by the Graduate School of Informatics in Kyoto University, was modified to MeCab-ko by the Eunjeon Project to adapt to the Korean language.
In order to use MeCab-ko within KoNLPy, follow the directions in optional-installations.
- Parameters
dicpath – The path of the MeCab-ko dictionary.
Module contents¶
NLP Analysis Tools.
unipy_nlp¶
- Provides
NLP Data Handling Tools
NLP Analysis Functions.
POS-Tagger (MeCab, non-sudo intall required.)
Topic Modeling(LDA)
Word2Vec
NLP network plot.
How to use¶
In terms of Data science, Data Preprocessing & Plotting is one of the most
annoying parts of Data Analysis. unipy offers you many functions maybe
once you have tried to search in google or stackoverflow.
- The docstring examples assume that unipy has been imported as up::
>>> import unipy_nlp as unlp
- Use the built-in
helpfunction to view a function’s docstring:: >>> help(np.sort) ... # doctest: +SKIP
General-purpose documents like a glossary and help on the basic concepts
of numpy are available under the docs sub-module:
>>> from unipy import docs
>>> help(docs)
...
Available subpackages¶
- data_collector
Get Data from xlsx.
- tagger
A MeCab Wrapper without installation.
- preprocessing
Tokenize text with sentencepiece & MeCab, from xlsx & Elasticsearch.
- analyze
Topic Modeling(LDA) & Word2Vec.
- network_plot
A N-gram network plot.
- test
Test-codes of
unipy_nlp
This module provides a number of useful functions for natural language handling.