Expert Analysis

AI Model Development Plan for Urban Legend & Creepypasta Categorization

AI Model Development Plan for Urban Legend & Creepypasta Categorization

Task 1: Develop an AI model to identify and categorize urban legends and creepypastas.

Sub-tasks:

  • Data Collection and Annotation:
* Identify sources: Scrape websites like Creepypasta Wiki, Urban Legends Online, Snopes (for debunked legends), Reddit (r/creepypasta, r/urbanlegends).

* Collect text data: Extract raw text of stories, potentially including metadata (source, date, tags).

* Annotation strategy:

* Initial manual annotation of a small dataset to define categories (e.g., "supernatural", "psychological horror", "real-world threat", "debunked").

* Consider crowdsourcing or semi-supervised labeling for larger datasets.

* Dataset split: Create training, validation, and test sets.

  • Data Preprocessing:
* Text cleaning: Remove HTML tags, special characters, irrelevant metadata.

* Tokenization: Break down text into words/subwords.

* Stop word removal and lemmatization/stemming: Reduce noise and standardize words.

* Feature extraction/embedding: Convert text into numerical representations (e.g., using pre-trained word embeddings like Word2Vec, GloVe, or BERT embeddings).

  • Model Selection and Architecture:
* Base Model: Utilize a pre-trained Transformer-based language model (e.g., BERT, RoBERTa, DistilBERT) known for strong performance in text classification.

* Fine-tuning: Adapt the pre-trained model to our specific dataset and categories.

* Classification Layer: Add a custom classification head (e.g., a few dense layers with a softmax activation) on top of the Transformer encoder.

  • Model Training:
* Environment: Use Python with libraries like Transformers (Hugging Face), PyTorch/TensorFlow, scikit-learn.

* Hyperparameter tuning: Optimize learning rate, batch size, number of epochs.

* Loss function: Categorical cross-entropy for multi-class classification.

* GPU acceleration: Essential for efficient training of Transformer models.

  • Model Evaluation:
* Metrics: Accuracy, precision, recall, F1-score for each category.

* Confusion matrix: Visualize classification performance.

* Error analysis: Investigate misclassified examples to identify areas for improvement.

  • Deployment (Initial Thought):
* Containerize the model using Docker.

* Expose as a REST API endpoint for integration with other components (e.g., article generation).

Next Steps for Beru:
  • Begin collecting potential data sources and formulating a concrete plan for data collection for the first category.
  • Once data is acquired, we will move to implementation.

📚 Related Research Papers