Lip Reading 💬

The goal is to convert visual information from lip movements into text. This involves recognizing and interpreting the movements of a speaker's lips to accurately transcribe spoken words.

📑 Table of Contents

About the Project
Results
Tech Stack
File Structure
Dataset
Model Architecture
Installation and Setup
Future Scope
Acknowledgements
Contributors

📘 About the Project

This project focuses on developing a sophisticated lip-reading system that interprets spoken words from sequences of images. Using Haar Cascade classifiers for face extraction and dlib’s facial landmark detection for lip extraction, we effectively preprocess the data. A train-test split ensures robust model evaluation. The core of the project is a hybrid model combining 3D CNNs, which capture spatial features, and LSTMs, which understand temporal dynamics. Extensive hyperparameter tuning enhances the model’s accuracy. The system has been tested on online videos for accuracy and reliability and includes a live detection feature to showcase real-time capabilities.

📊 Results

Live Deteection

lip_reading.mp4

Online Testing

Confusion Matrix


For Words	For Phrases

Accuracy

⚙️ Tech Stack

Category	Technologies
Programming Languages
Frameworks
Libraries
Deep Learning Models
Dataset
Tools
Visualization & Analysis

📁 File Structure

├── Dataset Preprocessing
   ├── xml files
      ├── haarcascade_frontalface_default.xml
      ├── haarcascade_mcs_mouth.xml
      ├── shape_predictor_68_face_landmarks.dat
   ├── 01_Face_Extraction.ipynb
   ├── 02_Lip_Extraction.ipynb
   ├── 03_Train_Test_Split.ipynb
├── Hyperparameter Tuning
   ├── Grid Search.ipynb
   ├── Random Search.ipynb
├── Mini Projects
   ├── Cat_Dog_Classifier_CNN.ipynb
   ├── Human_Action_Recognition_LSTM.ipynb
   ├── Next_Word_Predictor_LSTM.ipynb
   ├── Video_Anomaly_Detection_CNN_LSTM.ipynb
├── Model Architecture
   ├── Saved Model
      ├── 3D_CNN_Bi-LSTM.h5
   ├── 3D_CNN.ipynb
   ├── 3D_CNN_Bi-LSTM.ipynb
   ├── 3D_CNN_From_Scratch.ipynb
   ├── 3D_CNN_LSTM.ipynb
   ├── Adam.ipynb
   ├── CategoricalCrossentropy.ipynb
   ├── Data_Augmentation.ipynb
   ├── Dropout.ipynb
   ├── EarlyStopping.ipynb
   ├── L1_Regularization.ipynb
   ├── L2_Regularization_1.ipynb
   ├── L2_Regularization_2.ipynb
   ├── RMSprop.ipynb
├── Model Evaluation
   ├── Accuracy.ipynb
   ├── Live_Detection.ipynb
   ├── Onlne_Testing.ipynb
   ├── Precision.ipynb
   ├── Recall.ipynb
├── Notes
   ├── LSTM
   ├── OpenCV
   ├── Om Mukherjee
   ├── Sourish Phate       
├── README.md

💾 Dataset: MIRACL-VC1

The MIRACL-VC1 dataset is structured to facilitate research in visual speech recognition, particularly lip reading. Here's a breakdown of its structure and contents:

Data Composition:

Video Clips: The dataset contains short video clips of multiple speakers reciting specific phrases. Each clip captures the upper body, focusing mainly on the face and mouth area.
Speakers: It features several speakers from diverse backgrounds, which helps models generalize across different individuals and speaking styles.
Languages: The dataset is typically in English, though speakers may vary in accents and pronunciations.
Phrases: Each video clip corresponds to one of a predefined set of phrases, which are recited by the speakers. The phrases are usually short and may cover simple daily expressions or numbers.

Dataset Contains The Following Words and Phrases:

Download the MIRACL-VC1 dataset on Kaggle

🤖 Model Architecture

3D Convolutional Neural Network (3D CNN): Several convolutional layers are used, each followed by activation functions and pooling layers to reduce dimensionality while preserving essential features.
Reshape Layer: The tensor dimensions are adjusted to flatten the spatial data into a format that the LSTM can process.
Long Short-Term Memory (LSTM): One or more LSTM layers are employed to process the sequential data, enabling the model to retain information over time and improve prediction accuracy.
Flatten Layer: This flattens the data without altering its values, preparing it for the next stage.
Dropout Layer: A dropout rate is set (e.g., 0.5) to control the fraction of neurons dropped this prevents overfitting.
Dense Layers: One or more dense layers with activation functions (e.g., softmax for multi-class classification) are used to output the prediction probabilities.

By combining these components, the model effectively learns to interpret lip movements, translating them into accurate predictions of spoken words.

🛠️ Installation and Setup

Follow these steps to set up the project environment and install necessary dependencies.

Prerequisites

Ensure you have the following software installed:

Clone the Repository

Clone the project repository from GitHub:

git clone https://github.com/sourishphate/Project-X-Lip-Reading.git
cd Project-X-Lip-Reading

Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

Run the Application

Start the live detection application using the following command:

python '.\Model Evaluation\Live_detection.py

Troubleshooting

If you encounter issues or want to suggest any improvements raise an issue on GitHub.

🌟 Future Scope

Multilingual Model: Extend the current lip-reading model to support multiple languages, making it adaptable for a global audience and capable of handling diverse linguistic inputs.
User Interface Development: Design a user-friendly interface that allows real-time interaction with the lip-reading model, improving accessibility and practical usability.
Sentence-Level Lip Reading: Upgrade the model to read and interpret entire sentences, moving beyond word-level predictions to understand more complex speech patterns.
Large-Scale Model with Bigger Datasets: Transition to a large-scale model by training with much larger datasets, which will boost the model’s ability to generalize across various lip movements, leading to greater accuracy.

📜 Acknowledgement

We would like to express our gratitude to all the tools and courses which helped in successful completion of this project.

Research Papers

Courses

A special thanks to our project mentor Veeransh Shah and to the entire Project X community for unwavering support and guidance throughout this journey.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lip Reading 💬

📑 Table of Contents

📘 About the Project

📊 Results

Live Deteection

Online Testing

Confusion Matrix

Accuracy

⚙️ Tech Stack

📁 File Structure

💾 Dataset: MIRACL-VC1

Data Composition:

Dataset Contains The Following Words and Phrases:

🤖 Model Architecture

🛠️ Installation and Setup

Prerequisites

Clone the Repository

Install Dependencies

Run the Application

Troubleshooting

🌟 Future Scope

📜 Acknowledgement

👥 Contributors

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
Dataset Preprocessing		Dataset Preprocessing
Hyperparameter Tuning		Hyperparameter Tuning
Mini Projects		Mini Projects
Model Architecture		Model Architecture
Model Evaluation		Model Evaluation
Notes		Notes
README.md		README.md

meekhumor/Project-X-Lip-Reading

Folders and files

Latest commit

History

Repository files navigation

Lip Reading 💬

📑 Table of Contents

📘 About the Project

📊 Results

Live Deteection

Online Testing

Confusion Matrix

Accuracy

⚙️ Tech Stack

📁 File Structure

💾 Dataset: MIRACL-VC1

Data Composition:

Dataset Contains The Following Words and Phrases:

🤖 Model Architecture

🛠️ Installation and Setup

Prerequisites

Clone the Repository

Install Dependencies

Run the Application

Troubleshooting

🌟 Future Scope

📜 Acknowledgement

👥 Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages