AI Model Development Plan for Urban Legend & Creepypasta Categorization
AI Model Development Plan for Urban Legend & Creepypasta Categorization
Task 1: Develop an AI model to identify and categorize urban legends and creepypastas.
Sub-tasks:
- Data Collection and Annotation:
* Collect text data: Extract raw text of stories, potentially including metadata (source, date, tags).
* Annotation strategy:
* Initial manual annotation of a small dataset to define categories (e.g., "supernatural", "psychological horror", "real-world threat", "debunked").
* Consider crowdsourcing or semi-supervised labeling for larger datasets.
* Dataset split: Create training, validation, and test sets.
- Data Preprocessing:
* Tokenization: Break down text into words/subwords.
* Stop word removal and lemmatization/stemming: Reduce noise and standardize words.
* Feature extraction/embedding: Convert text into numerical representations (e.g., using pre-trained word embeddings like Word2Vec, GloVe, or BERT embeddings).
- Model Selection and Architecture:
* Fine-tuning: Adapt the pre-trained model to our specific dataset and categories.
* Classification Layer: Add a custom classification head (e.g., a few dense layers with a softmax activation) on top of the Transformer encoder.
- Model Training:
* Hyperparameter tuning: Optimize learning rate, batch size, number of epochs.
* Loss function: Categorical cross-entropy for multi-class classification.
* GPU acceleration: Essential for efficient training of Transformer models.
- Model Evaluation:
* Confusion matrix: Visualize classification performance.
* Error analysis: Investigate misclassified examples to identify areas for improvement.
- Deployment (Initial Thought):
* Expose as a REST API endpoint for integration with other components (e.g., article generation).
Next Steps for Beru:- Begin collecting potential data sources and formulating a concrete plan for data collection for the first category.
- Once data is acquired, we will move to implementation.