My suggestions are for the purpose of getting more natural, realistic audio in avatar videos. I believe these proposals are rather small feature changes that would make a big difference. Many people find obviously fake voices off-putting, and the naturalness/humanness of the audio quality of the avatars would massively benefit from the following:
Allow the use of multiple Avatar Voice options within the same clip. If users could specify where and when to switch to each, the videos could become much more dynamic and engaging. This is not about using different people's voices speaking, but rather, different versions of the same Avatar Voice (I am referring to Cloned Voices). Currently, Cloned Avatar voices can be kind of flattened, since users can only provide a short amount of training dialogue audio. It seems to me that the best way to have different tones and moods under the current parameters is to train different versions of the same avatar voice on dialogue spoken in different moods or attitudes, ex.: "This video will utilize "Randy v1 [Excited]," "Randy v3 [Casual]," "Randy v7 [Serious]""
Allow users to modify the speech tempo in different points in the video, and specify where in the transcript or via telling the agent. My suggestion would be to have micro-options available. I know many services will go from "1x" to "1.1x" or "0.9x" - a 10% increment jump. This is too big of a jump for speech to sound natural. "1x" to "1.01x" to "1.02x" etc, (1% jumps - which would more realistically be used to ramp up to a tempo like 1.1x") or "1x -> 1.02x -> 1.04x" (2% increments) would be much more useful than a big unnatural sudden 10% jump in tempo.
It would be great to have a clear, obvious way to indicate pauses and silences in the transcript for the sake of building tension or anticipation in certain parts, or to show a visual that requires reading or time to absorb. Maybe this is already possible, but it's not clear.