184
4
# "The Name Of The Game Is Power" - AI Music Video Creation Process
See the video here https://www.youtube.com/watch?v=B_xeXRn-hc8
100% AI visuals, 100% human-made music
Music: "Gone In 60 Seconds" by Mark DK Berry (available on markdkberry.bandcamp.com)
## Introduction
With the release of Wan 2.1 i2v (image-to-video) model for Comfyui in early March 2025, it opened up a whole new world of creative possibility. The track "Gone In 60 Seconds" seemed like a good choice for my next music video to test it out.
## Hardware & Setup
- Equipment: RTX 3060 (12GB VRAM), Windows 10, 32GB system RAM
- Required Installations: Sage attention, Triton, and Teacache
- Note: First sage attention install killed my Comfyui installation and required rebuilding (blessing in disguise as my setup had become bloated)
## Workflow Overview
I spent a few days researching workflows then picked the best I could find. The workflow is available here
I ended up using the same i2v workflow almost exclusively because it included interpolation and upscaling, which cut down post-production time. When making previous AI music videos, I limited myself to 5 days, but with image-to-video I aimed for better quality and more accurate depictions, giving it more time. I aimed at 10 days but managed to complete this version in 8.
## Image Creation Process
1. Used Flux-dev-fp8 in Comfyui to create base images (1344 x 768)
2. Used Krita with ACLY AI plugin (SDXL and Flux model) for segmenting, inpainting, and tweaking
3. Created my own Lora in Flux (3-hour process) to avoid copyright issues after previous experiences (LORA workflow is also attached in file section it worked on 3060 RTX 12GB VRam)
4. Shotcut (or Topaz) for interpolating from 16 to 24 fps (I didnt use the Topaz enhancement features as I didnt feel it helped much after testing; it just improved the clarity of the underlying digital gremlins).
## Prompt Engineering
Once input images were ready, I used AI assistants (Claude, Grok, or ChatGPT) to generate prompts based on criteria from a prompt-extender developed by Wan developers. This worked much better than simple 3-sentence requests (copy and paste it from the code here - https://github.com/Wan-Video/Wan2.1/blob/main/wan/utils/prompt_extend.py ).
## Video Generation
- Used Wan 2.1 i2v workflow to create 3-second video clips (16fps with upscale and interpolation)
- Most clips required 3-4 attempts, but some (especially car & motorcycle acceleration shots) took 20+ attempts
- Tried multiple Wan models but all struggled with moving vehicles while keeping other elements stationary
- Seed changes resolved most issues
## Post-Production
- Assembled clips in Davinci Resolve
- Spent time replacing unusable shots
- Acceptance & drawing a line in the sand
## Lessons Learned
- Organization: This process required tracking much more information than before, including shot-naming conventions and multiple CSV sheets to manage good/bad takes. When lipsync and ambient audio comes along, this is going to be challenging work for one person.
- Flexibility: The story I began with wasn't the story I ended up creating - it's best to start with a plan but allow AI to redirect it
- Quality vs. Storytelling: Sometimes sacrificing visual quality for storytelling works better, especially given hardware limitations
## Final Thoughts
This AI model really feels like a leap ahead into a new dawn in visual storytelling. Just like in the 1920s when silent movies emerged, we are in the AI equivalent in 2025. Perhaps independant artists and activists will even stand a chance of challenging the Hollywood & Netflix narratives within a year or two.
Follow and support me on my social media accounts as I continue this journey into AI music video creation. I hope you enjoyed this one.
markdkberry.com
markdkberry.bandcamp.com
IG @ markdkberry
X @ markdkberry
#aimusicvideo #ai #music #musicvideo #markdkberry #wanx #comfyui
These are separate files that the creator has uploaded for this workflow.
Lora-training-workflow_adafactor_splitmode_dimalpha64_3000steps_low10GBVRAM.json