Skip to content

meekhumor/Project-X-Lip-Reading

 
 

Repository files navigation

Lip Reading 💬

read-my-lips

The goal is to convert visual information from lip movements into text. This involves recognizing and interpreting the movements of a speaker's lips to accurately transcribe spoken words.

📑 Table of Contents

📘 About the Project

This project focuses on developing a sophisticated lip-reading system that interprets spoken words from sequences of images. Using Haar Cascade classifiers for face extraction and dlib’s facial landmark detection for lip extraction, we effectively preprocess the data. A train-test split ensures robust model evaluation. The core of the project is a hybrid model combining 3D CNNs, which capture spatial features, and LSTMs, which understand temporal dynamics. Extensive hyperparameter tuning enhances the model’s accuracy. The system has been tested on online videos for accuracy and reliability and includes a live detection feature to showcase real-time capabilities.

📊 Results

Live Deteection

lip_reading.mp4

Online Testing

choose_american arrow second_image
choose_american arrow second_image

Confusion Matrix

Image 1 Image 2
For Words For Phrases

Accuracy

read-my-lips

⚙️ Tech Stack

Category Technologies
Programming Languages Python
Frameworks TensorFlow Keras
Libraries OpenCV NumPy
Deep Learning Models LSTM CNN
Dataset MIRACL-VC1
Tools Git Google Colab
Visualization & Analysis Matplotlib Seaborn

📁 File Structure

├── Dataset Preprocessing
   ├── xml files
      ├── haarcascade_frontalface_default.xml
      ├── haarcascade_mcs_mouth.xml
      ├── shape_predictor_68_face_landmarks.dat
   ├── 01_Face_Extraction.ipynb
   ├── 02_Lip_Extraction.ipynb
   ├── 03_Train_Test_Split.ipynb
├── Hyperparameter Tuning
   ├── Grid Search.ipynb
   ├── Random Search.ipynb
├── Mini Projects
   ├── Cat_Dog_Classifier_CNN.ipynb
   ├── Human_Action_Recognition_LSTM.ipynb
   ├── Next_Word_Predictor_LSTM.ipynb
   ├── Video_Anomaly_Detection_CNN_LSTM.ipynb
├── Model Architecture
   ├── Saved Model
      ├── 3D_CNN_Bi-LSTM.h5
   ├── 3D_CNN.ipynb
   ├── 3D_CNN_Bi-LSTM.ipynb
   ├── 3D_CNN_From_Scratch.ipynb
   ├── 3D_CNN_LSTM.ipynb
   ├── Adam.ipynb
   ├── CategoricalCrossentropy.ipynb
   ├── Data_Augmentation.ipynb
   ├── Dropout.ipynb
   ├── EarlyStopping.ipynb
   ├── L1_Regularization.ipynb
   ├── L2_Regularization_1.ipynb
   ├── L2_Regularization_2.ipynb
   ├── RMSprop.ipynb
├── Model Evaluation
   ├── Accuracy.ipynb
   ├── Live_Detection.ipynb
   ├── Onlne_Testing.ipynb
   ├── Precision.ipynb
   ├── Recall.ipynb
├── Notes
   ├── LSTM
   ├── OpenCV
   ├── Om Mukherjee
   ├── Sourish Phate       
├── README.md

💾 Dataset: MIRACL-VC1

The MIRACL-VC1 dataset is structured to facilitate research in visual speech recognition, particularly lip reading. Here's a breakdown of its structure and contents:

Data Composition:

  • Video Clips: The dataset contains short video clips of multiple speakers reciting specific phrases. Each clip captures the upper body, focusing mainly on the face and mouth area.
  • Speakers: It features several speakers from diverse backgrounds, which helps models generalize across different individuals and speaking styles.
  • Languages: The dataset is typically in English, though speakers may vary in accents and pronunciations.
  • Phrases: Each video clip corresponds to one of a predefined set of phrases, which are recited by the speakers. The phrases are usually short and may cover simple daily expressions or numbers.

Dataset Contains The Following Words and Phrases:

image

Download the MIRACL-VC1 dataset on Kaggle

🤖 Model Architecture

276662464-b1a8a17b-da29-4424-9e5c-b3f51dd07a27

  1. 3D Convolutional Neural Network (3D CNN): Several convolutional layers are used, each followed by activation functions and pooling layers to reduce dimensionality while preserving essential features.

  2. Reshape Layer: The tensor dimensions are adjusted to flatten the spatial data into a format that the LSTM can process.

  3. Long Short-Term Memory (LSTM): One or more LSTM layers are employed to process the sequential data, enabling the model to retain information over time and improve prediction accuracy.

  4. Flatten Layer: This flattens the data without altering its values, preparing it for the next stage.

  5. Dropout Layer: A dropout rate is set (e.g., 0.5) to control the fraction of neurons dropped this prevents overfitting.

  6. Dense Layers: One or more dense layers with activation functions (e.g., softmax for multi-class classification) are used to output the prediction probabilities.

By combining these components, the model effectively learns to interpret lip movements, translating them into accurate predictions of spoken words.

🛠️ Installation and Setup

Follow these steps to set up the project environment and install necessary dependencies.

Prerequisites

Ensure you have the following software installed:

Clone the Repository

Clone the project repository from GitHub:

git clone https://github.com/sourishphate/Project-X-Lip-Reading.git
cd Project-X-Lip-Reading

Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

Run the Application

Start the live detection application using the following command:

python '.\Model Evaluation\Live_detection.py

Troubleshooting

If you encounter issues or want to suggest any improvements raise an issue on GitHub.

🌟 Future Scope

  • Multilingual Model: Extend the current lip-reading model to support multiple languages, making it adaptable for a global audience and capable of handling diverse linguistic inputs.

  • User Interface Development: Design a user-friendly interface that allows real-time interaction with the lip-reading model, improving accessibility and practical usability.

  • Sentence-Level Lip Reading: Upgrade the model to read and interpret entire sentences, moving beyond word-level predictions to understand more complex speech patterns.

  • Large-Scale Model with Bigger Datasets: Transition to a large-scale model by training with much larger datasets, which will boost the model’s ability to generalize across various lip movements, leading to greater accuracy.

📜 Acknowledgement

We would like to express our gratitude to all the tools and courses which helped in successful completion of this project.

Research Papers

Courses

A special thanks to our project mentor Veeransh Shah and to the entire Project X community for unwavering support and guidance throughout this journey.

👥 Contributors

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.9%
  • Python 0.1%