Skip to content

Tanat05/Korcen-13M-EXAONE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Korcen-13M-EXAONE (Incomplete)

This failure, though another, is a better one.

131_20220604170616

"Refined Intelligence: Enhanced Accuracy and Adaptability in ML Filtering."

This project initially aimed to be an advanced iteration of our machine learning-based filter, leveraging a significantly larger dataset. However, the project faced a setback due to the compromised quality of this expanded data, ultimately leading to unsatisfactory filtering performance.

Undeterred by this challenge, we are committed to overcoming this data quality issue. We are actively focusing on refining our data acquisition and cleaning processes and will continue to develop and release upgraded models that progressively enhance accuracy, reduce false positives, and improve adaptability to evolving slang and offensive language. Our dedication to providing a robust and reliable filtering solution remains unwavering.

Korcen: original before innovation.

Korcen-kogpt2: First innovation and first failure

Model Overview

total samples: 14,879,960
Training samples: 11,903,968
Validation samples: 2,975,992

Parameters: 13,197,569

Training time: 4h

Tokenizer: EXAONE 3.5 Tokenizer (vocab size: 102,400)

Figure_1

Example PY: 3.10 TF: 2.10

import numpy as np
import tensorflow as tf
from transformers import AutoTokenizer
import os

print("TensorFlow Version:", tf.__version__)

MODEL_LOAD_PATH = 'abusive_language_model_exaone_based.h5'
TOKENIZER_DIR = 'tokenizer_directory'
MAX_LENGTH = 128

try:
    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_DIR)
    print("Tokenizer loaded successfully.")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    print("Please ensure the tokenizer directory exists and is correct.")
    exit()

try:
    model = tf.keras.models.load_model(MODEL_LOAD_PATH)
    model.summary()
    print("Model loaded successfully.")
except Exception as e:
    print(f"Error loading model: {e}")
    print(f"Please ensure the model file exists at {MODEL_LOAD_PATH} and TensorFlow version is compatible.")
    exit()

def preprocess_text(text, tokenizer, max_len):
    processed_text = text.lower()
    encoded = tokenizer(
        processed_text,
        max_length=max_len,
        padding='max_length',
        truncation=True,
        return_tensors='np'
    )
    return encoded['input_ids']

def predict_abusive(text, model, tokenizer, max_len, threshold=0.5):
    processed_input = preprocess_text(text, tokenizer, max_len)
    probability = model.predict(processed_input)
    prediction = (probability >= threshold).astype(int)
    return probability.flatten()[0], prediction.flatten()[0]

input_text = input("Please enter a sentence: ")
probability, label = predict_abusive(input_text, model, tokenizer, MAX_LENGTH)

label_text = "욕설 (Abusive)" if label == 1 else "정상 (Normal)"
print(f"Text: \"{input_text}\"")
print(f"Probability (Abusive): {probability:.4f}")
print(f"Predicted Label: {label_text} ({label})")

About

This AI model is specifically trained to detect and classify Korean profanity with high accuracy.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages