Skip to content

How do I control the emotion in demo.py #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chakri1804 opened this issue Jan 17, 2025 · 1 comment
Open

How do I control the emotion in demo.py #5

chakri1804 opened this issue Jan 17, 2025 · 1 comment

Comments

@chakri1804
Copy link

Thanks for sharing this amazing work
I was able to setup the env and run the demo.py for getting the final renders
I have a couple of questions and pointers

  • How can I control the emotion of the final renders as shown in the paper ?
    I can see the indices dictionary for emotions
    But I'm not sure which variable I can feed these indices as to control the final renders
  • Another less important and just for my curiosity. Is there a way to fine-tune the network for a new subject ?
    Lets say I have a new person's data and I've tracked their flame params with spectre or some other FLAME tracker.
    What sort of data should I be collecting from that person ? And how do I fine tune the model ?
  • The requirements.txt file is sort of problematic.
    Newer conda environments are shipping with pip 24.2 and the requirements want pip 24.0.
    Else certain packages fail to install (hydra-core and fairseq)
    Once you downgrade pip, comment out funasr packages from the list and install the rest.
    Finally pip install --no-deps funasr==1.1.16 to complete the installation without breaking others
    This worked for me. You might want to add this to your README for future reference
@whwjdqls
Copy link
Owner

whwjdqls commented Jan 18, 2025

Hi chakri1804,

  1. The emotions cannot be controlled. The facial expressions (emotions) are driven by the audio, not any explicit inputs. However, you can use a reference audio or facial expression to extract DEE features, and feed it to the network instead of using the input audio DEE features, like Fig 6 in our main paper.
  2. There are some works that do address speaking styles (like imitator, mimic) of difference subjects. But ours including works like emotalk or EMOTE does not. We only input an actor identity one-hot vector so that the network will not be confused by the speaking styles of different subjects. So if you were to fine-tune our model with a new person's data, the most naive way would be replacing the Actor identity projection layer with a new layer and fine-tuning the whole network using stage2 training scheme.

Thanks for the environment tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants