Localizing On-Screen Lyrics in “Tokyo Winter Session” Music Video

By Xingyue (Silver) Zhang

Several months ago, my friend recommended a song by HoneyWorks called “Tokyo Summer Session“, a lovely anime song about conversations between three Japanese high school couples going to a summer festival. Later, I found out that there are three other songs in this series to complete all four seasons, and I ended up choosing “Tokyo Winter Session” as my asset to work with. I localized the on-screen lyrics in the first minute from Japanese into English and Chinese with Adobe After Effects and Photoshop.

Please check out the video below to get a glimpse of the whole process!

1-minute walkthrough video

Here is a comparison of the original clip (left) and my localized version (right).

Original Japanese Version (Left) vs Localized English Version (Right)
Original Japanese Version (Left) vs Localized Chinese Version (Right)

When I first casually watched this music video and hummed along, I thought this would be an easy project to work on, as 95% of the text is in solid-color textboxes. Even though there are two transition scenes that I knew would require rotoscoping for sure, I didn’t think it would be too bad. However, I was naive for thinking this, because there ended up being many details I didn’t notice until I actually examined the video frame by frame! Here are the top 3 challenges I encountered:

  • Hiding the text (mask layer & light rotoscoping)
  • Text Animation (ramp up & bouncing effects)
  • Background Scene Transitions (heavy rotoscoping & content-aware fill)

Hiding the text (mask layer & light rotoscoping)

In this one-minute clip, there are two types of text boxes. One is the black boxes (internal monologue), and the other is the solid-color textboxes (dialogue between characters). 

When hiding the text in the dialogue, I used After Effects to create a shape layer, picked the solid background color with the eyedropper, drew the mask, set up keyframes for locations (the mask layer needed to move along with the text box), and adjusted the opacity value when the text box faded out. 

Tons of position keyframes to keep the overlay mask moving on the right track (in red)

However, this mask layer method could not be applied to places where two dialogue boxes overlap each other, so I had to export this short section (around 4-5 frames) to Photoshop, manually remove the text that was underneath, and reimport the Photoshop sequence (series of .psd files) back in as footage. 

I used the same trick for hiding the text in the black box because the text appears character by character as the box unfolds itself, so a solid black layer blanket wouldn’t help.  

Box unfolds

There are 21 lines of lyrics in the clip, and as you can see from my working folder, I exported every transition section between the lines and cleaned them up in Photoshop – this is the biggest thing that I failed to notice before starting the project.

Text animation

Just like the mask layers, the lines of lyrics also needed to move along with the dialogue boxes, so it was just as time-consuming as tweaking the mask to the correct location at the correct time. Besides that, there were two other text animations that I spent more time playing around with:    

  1. Ramp Up 
  2. Bouncing 

Almost all the text in the black boxes shows up with the “Ramp Up” effect (or wavy effect, as I like to call it, because they look like waves), and there are many values to tweak in order to get similar “wavelengths” and “altitudes” to the original ones. 

The lines in dialogue boxes appear with a bouncing effect, which involved essentially scaling the text up and down and making it synchronize with the box animation. I had to drag the time indicator to play those sections frame by frame and over and over just to see how the size of the textbox was changing, then applied those subtle changes to the text accordingly. It was actually kind of fun when I did the first one or two, but it was another story after doing ten in a row.    

Flipping+Bouncing Effect

Background Scene Transitions (heavy rotoscoping & content-aware in AE)

Dealing with the background scene transitions was the biggest challenge in this whole project because in this case, I couldn’t do the solid-color mask trick anymore, so I had to spend a lot of time removing the text from a more complicated background in Photoshop. 

There are two major background scene transitions in this clip. The first one is at 00:29-00:30 when Natsuki (the female character) says “もー!(Argh!)”

Argh!

The second one is a line of lyrics “影想いを近づけた” in the final scene that lasts 4 seconds. 

As I have been using Photoshop for manga typesetting for almost a year, I was confident with my text-removing skills using content-aware and other functions in PS. Therefore, I chose to manually clean the 20 PSD files for the first major transition, and honestly, it wasn’t too bad, as I could easily mask most parts of the text by cutting, copying, and pasting the clean parts I needed from the previous frames and tweaking them. It was just the repetition that tired me out a bit.

20 PSD files is still doable

When I found out that there were almost 100 PSD files for the second one, I was pretty sure that there was no way I could do this manually in PS, and I had to use the content-aware fill (CAF) function in AE instead.

Oh well…

So, I followed a CAF tutorial video on YouTube and started my trial experiment. I first used the pen tool to draw masks for each character and then subtracted the layers. Before clicking on the “Generate Fill Layer” in the content-aware fill window, I added a couple of reference frames (opened the frame in PS and manually removed the text just like what I did for the previous transition scenes) to help AE better analyze the background and get optimized content-aware fill results.

10 reference frames helped content-aware fill better analyze the background

The first round of generating results was not very satisfying, so I continued adding more reference frames to the places where the AE messed up and kept generating. Finally, with 10 reference frames, I got a satisfying result and you can barely see any trace of edits after I put the translated line back in! Even though I spent more than two hours on this transition, I am really happy with the final outcome!

Looking back on my journey

I have to admit that I underestimated the workload from the very beginning, so I spent twice as much time as I expected. I should have looked at the video closer and evaluated the project more comprehensively. That being said, I enjoyed the whole process, playing around with the settings I learned and exploring new functions on my own. I am also glad that I still love this song and am still willing to sing it even after 15 hours of work!  

Leave a Comment