Shot Composition for Short-Form Video

Composition principles from DPs who shoot real movies (Deakins, Yeoman, Fraser, Messerschmidt) re-anchored to the 9:16 phone frame, with safe-zone math, real creator examples, and decisions you can make on set.

Beginner18 min readUpdated May 2026
#framing#composition#angles#cinematography#visuals
Shot Composition for Content Creators hero image

Composition Starts With "Use Your Eyes"

By Bell Chen, founder. Updated May 18, 2026.

In a PetaPixel interview, the cinematographer Roger Deakins, who has been nominated for the Best Cinematography Oscar sixteen times and finally won for Blade Runner 2049 and 1917, told the publication: "I am familiar with the 'rule of thirds', but I have not considered it a conscious part of my approach since I attended art college in the 1960s." His advice for capturing a good image was four words: "just use your eyes." The quote is worth opening on because the rule of thirds is the first thing every short-form composition guide on the internet teaches you, and the working cinematographer most people would name as the best in the world has not consciously thought about it in sixty years.

This page is not a rebuttal of the rule of thirds. It is an attempt to recover composition as a set of decisions a real DP would defend, then apply each one to the 9:16 phone frame the way a working DP would if their next shoot were a TikTok. The named cinematographers below shot the films that won the Oscars in the years they were eligible. The named creators below produce short-form work that most directly inherits from those craft decisions, with the same constraints Deakins, Yeoman, Fraser, and Kamiński face daily on the floor.

Why Composition Decides Whether the Viewer Keeps Watching

Erik Messerschmidt, who won the Best Cinematography Oscar in 2021 for Mank, told Leitz Cine in an interview with director David Fincher, "I think cinematographers are often over-credited for the way a movie looks, and under-credited for the way the story is told." The line is the cleanest formulation of the principle that matters most for the phone frame: composition is not aesthetics, composition is the story telling the viewer where to look and what to feel about what they see.

On a 6.1-inch phone screen at arm's length, the viewer makes the keep-watching decision in roughly the same window the TikTok algorithm uses to decide whether to widen distribution (about three seconds), so the composition has to do its work fast. If the viewer cannot identify the subject of the frame in the first second, no amount of writing or editing in the next twenty seconds recovers it. The composition does not have to be clever. It has to be legible.

Jenny Hoyos, profiled at length in Marketing Examined's short-form playbook, articulates the same constraint from the YouTube Shorts side: "the hook needs to be very visual," with the first frame "so compelling that it could stand alone as a thumbnail for a long-form video." Hoyos sketches her opening frames on paper the way Roger Deakins printed still pre-visualizations on a Sony point-and-shoot for 1917, per No Film School's reporting on Deakins's process. Different scale of production. Same discipline. The opening frame is the contract with the viewer about what they are looking at.

The mistake, in my read of short-form accounts I audit, is treating composition as a "look" rather than as a decision tree. The decision tree has three branches. What is the shot size. What is the angle. What goes in the negative space. Every other "rule" people teach (thirds, leading lines, headroom) is downstream of these three.

Shot Size, Decided for the Phone Frame

Greig Fraser, who won the Best Cinematography Oscar for Dune in 2022, told Cinematography World about his approach with Denis Villeneuve that he and Villeneuve composed Dune as "a combination of wide shots showing the sweeping landscapes of the desert planet Arrakis with extreme close-ups for intimate moments between characters." Fraser believes that "switching between wide shots and closeups in a single scene can further the drama and scope of a moment." That alternation is the structure that holds Dune together visually. It is also the structure that holds together the short-form clips that perform on the phone.

The reason is mechanical. The phone screen is small. A wide shot of three people in a room reads to a viewer holding a phone at arm's length as a blurry beige rectangle. A close-up of one face reads as a face. The platform-native composition is the close-up by default. Cuts to wider sizes earn their place by carrying specific information (a location reveal, a piece of body language, a product context) that the close-up cannot show.

The seven traditional shot sizes (extreme close-up, close-up, medium close-up, medium, medium wide, wide, extreme wide) all remain available on the phone. The decision is which one is the resting state. For 95% of short-form accounts shipping talking-head, reaction, demo, or commentary work, the resting state is the medium close-up (chest to forehead). The close-up (collar to forehead) carries emotional beats. The extreme close-up (one feature filling the frame) carries hooks. The wider sizes (medium, medium wide, wide) carry context inserts that cut back to the medium close-up within two or three seconds.

The accounts that get this wrong tend to default to the medium shot (waist up) because it is what they see in news interviews on television. On a 16:9 horizontal screen at a viewing distance of twelve feet, the medium shot reads. On a 9:16 vertical screen at arm's length, the medium shot wastes half the frame on chest and shoulder. The same person framed as a medium close-up doubles the perceived presence at zero additional cost.

The exception is dance, performance, and full-body movement, where wider sizes earn their place because the body is the subject. Charli D'Amelio's early TikToks worked at medium-wide because the choreography was the shot. A talking-head founder explaining a B2B product at medium-wide is just an undersized founder. The shot size has to match what the audience came to see.

Angle, Decided for the Phone Frame and the Viewer's Lap

Camera angle (where the lens sits relative to the subject) carries the second-largest amount of unconscious information after shot size. The eye-level angle reads as neutral, peer, equal. The high angle reads as smaller-than, weaker-than, observed. The low angle reads as larger-than, more powerful, watched-from-below. The Dutch (canted) angle reads as off-balance, wrong, unstable.

The unstated assumption in most short-form composition advice is that the camera and the viewer are at the same height. They are not. The viewer is holding a phone in their lap or in front of their chest while reclining in bed or sitting on a couch. The viewer's eyeline is below the phone. If the creator shoots at the creator's own eye level (camera at the creator's eye height, sitting upright), the viewer experiences the camera as slightly above their own eyeline, and the resulting angle reads as a slight high angle on the viewer. The viewer feels mildly looked-down-on. This is one of the reasons "founder at desk explaining product" content underperforms even when the script and hook are strong.

The fix is to shoot the camera slightly below the creator's eye line, so the viewer's natural eyeline aligns with the creator's mouth or chin rather than with the creator's forehead. The result reads, to the viewer holding the phone in their lap, as eye-to-eye. Casey Neistat, whose vlog framing was reverse-engineered at length in In Depth Cine's breakdown of how he changed vlogging, builds this in mechanically by shooting up close on a Sony 12-24mm f/2.8 wide-angle lens with the camera positioned slightly low and the lens distortion pulling the viewer into the frame. The technique reads as intimacy. It is angle plus distortion solving the phone-in-lap problem.

The low angle (camera meaningfully below the creator's chest, looking up) reads on the phone as authority, confidence, or imposition. Cluely's office series uses this consistently for their cold opens, and the viewer perception lifts even before the script delivers. The high angle (camera above the creator's head looking down) reads as confession, intimacy, or vulnerability. Both are tools. The mistake is using them by accident.

The Dutch angle, which Janusz Kamiński used to inject unease into the Omaha Beach landing in Saving Private Ryan per American Cinematographer's 1998 feature on the shoot, carries the same destabilizing read on the phone. It is a strong choice. Use it for two-second cuts inside a fast-paced edit, not for a thirty-second talking-head section, or the viewer's eye gets tired and they scroll.

What "Rule of Thirds" Means on the Phone

The rule of thirds (divide the frame into a 3x3 grid; place subjects on the lines or at the intersections) is the most over-taught composition rule on the internet. Roger Deakins's PetaPixel quote (above) suggests it is not how the best-paid DP in cinema thinks about a frame. So why does it appear in every short-form video guide.

The honest answer is that the rule of thirds is useful as a default when the creator does not yet have an opinion about the shot. A face placed on the upper-third horizontal line will not be wrong. The eyes at the upper third intersection will not be wrong. The rule is correct as scaffolding. It is wrong only when treated as the goal rather than as the floor.

The opposite move, centered framing, is the signature of Wes Anderson's cinematographer Robert Yeoman, who has shot all nine of Anderson's live-action features. Cooke Optics, in their feature on the Anderson-Yeoman partnership, describes the working method: Yeoman has his camera assistant measure from the lens to the corners of the room when setting up a wide shot, so the camera is placed dead-center on the architecture. The result is The Grand Budapest Hotel, Asteroid City (shot in 1.37:1 for the black-and-white sequences and 2.40:1 for the color, per Kodak's Asteroid City production blog), and the entire visual vocabulary that has been parodied into the ground by "Wes Anderson TikToks" since 2023.

The Anderson-Yeoman lesson for short-form is that centered composition on a 9:16 phone frame is not the failure mode TV training warns against. It is a legitimate compositional choice that solves several phone-specific problems at once. It survives the platform UI (TikTok's right-side action column, captions across the bottom 20% of the frame, the username overlay) because everything important stays in the protected center column. It survives uncertain crop on shared playback (Instagram Reels reposted to TikTok and back). It reads as deliberate rather than as snapshot.

The rule that matters on the phone is not "thirds or center." It is "no accidental placement." Decide where the subject sits in the frame before you press record. If the placement is thirds, defend the thirds. If the placement is center, defend the center. The decision is the composition. The grid is a tool, not a verdict.

The 9:16 Safe Zone, With the Math

Per Kreatli's 2026 TikTok safe-zone documentation and Clickyapps's vertical-framing reference, the platform-specific safe zone on a 1080x1920 canvas is roughly: 150 pixels of margin from the top, 250 pixels of margin from the bottom (the bottom 13% is platform UI for action buttons and captions), 130 pixels of margin from each side. Anything outside that rectangle is partially or fully covered by the platform interface, the user's thumb during scroll, or the platform's caption track.

The compositional rule that follows is mechanical. Place the subject's eyes at roughly the upper-third line of the safe rectangle, not the upper-third line of the full frame. Subtract the top 150 pixels and the bottom 250 pixels from the 1920-pixel height. The protected zone is 1520 pixels tall. The upper-third line inside that zone sits at approximately the 660-pixel mark from the top of the full frame. That is where the eyes go. If the framing uses the unadjusted upper-third line (at pixel 640), the eyes will fall in the same place; if the framing uses the upper-third of the safe rectangle, the eyes adjust slightly down. Either is defensible. Floating heads at the literal top of the frame are not.

The bottom safe-zone constraint is the more commonly violated one. Creators putting on-screen text at the bottom 20% of the frame (where TikTok's caption track lives by default) get their text auto-covered the moment a viewer enables captions or the platform auto-generates them. The fix is to stack text above the bottom 250-pixel band, not in it. This is a constraint Anderson and Yeoman never had to solve for. It is the central craft problem of short-form composition that no traditional cinematography course teaches.

What Goes in the Empty Space

The traditional composition vocabulary calls this "negative space," which is correct but misleading because the space is not empty. It is the space where the audience reads context. Erik Messerschmidt, on Mank, told Leitz Cine that the choice of a slightly wider screen format was deliberate: "We wanted to make sure we have that extra space on the edges." The extra space carried the period, the room, the Old Hollywood production design. Without it, the close-ups would have collapsed onto the actors and lost the world the script depended on.

On the phone, the negative space is a problem because there is much less of it horizontally. The full 9:16 frame is 1080 pixels wide. The safe zone is roughly 820 pixels wide once you subtract the side margins. The horizontal negative space available for composition is one-third the width of a 16:9 frame at the same height. The implication is that horizontal placement variation (left-third versus right-third for the subject) carries less compositional weight than it would in cinema. Vertical placement variation (eyes higher in frame, lower in frame) carries more.

The decision tree for what goes in the negative space, ranked by how often it earns its place on short-form:

The single most useful element is a clean, recognizable, content-related background that does not compete with the subject. A bookshelf for a knowledge-work creator (Ali Abdaal's signature). A workshop for a maker. A kitchen for a food creator. The background does not have to be elaborate. It has to read in under a second and not contradict the script. The Casey Neistat fix (per the In Depth Cine breakdown) is to shoot wide enough that the New York street behind him is part of the frame, then let the wide-angle distortion pull the viewer in. The street is the negative space. The street is also the entire context for why a vlog from inside an apartment would not work as well.

The second most useful element is on-screen text that is part of the composition rather than bolted on top of it. Hoyos's fast-food recreations open with a fixed "$1" text overlay against the front of the restaurant location. The text is in the negative space. It is also doing hook work the visual could not do alone. The mistake is to use text as an alternative to composition rather than as part of it.

The third is product, prop, or artifact placed deliberately in the frame. Ramp's "Brian's Office" stunt placed Andy Buckley sitting in a glass box on Flatiron Plaza with paper expense receipts piled around him; the glass box and the receipts were the composition. The mistake on most B2B short-form is to default to "founder talking" with no prop or artifact at all, which collapses the entire shot onto the face and forfeits the negative space for context.

Vertical-Specific Composition Decisions

Vertical video (9:16) is not horizontal video rotated 90 degrees. It is a different frame with different physics. Three rules survive from horizontal cinematography. Three new rules emerge from the phone.

The three rules that survive: shot size discipline (medium close-up as resting state), angle discipline (eye level minus a touch, not eye level), and headroom (eyes on or near the upper third). These are the floor that Deakins's "just use your eyes" advice still defends. A face with no headroom on a phone frame still reads as cramped. A face floating at the top of the frame still reads as off.

The three new rules that emerge from the phone:

Centered subjects work better than off-center ones, per the Anderson-Yeoman case above and per the safe-zone pixel math. The protected center column is the only column that survives every platform's UI overlay.

Vertical movement (tilts up, drops down, vertical reveals) reads better than horizontal movement (pans left to right) in 9:16. The frame is taller than it is wide. Movement that exploits the long axis lands. Movement that fights the long axis feels truncated. Robert Yeoman's signature whip pans, per his Hunger Magazine breakdown, are a horizontal-frame device. Their vertical-frame equivalent is the whip-tilt, which has not yet been popularized at the same level on TikTok but appears in the strongest accounts (Sam Kolder's travel-vlog "camera-through-the-floor" transitions, per the Sam Kolder Effect breakdown at Rugged Road Trips).

Closer is better. On a 6.1-inch phone screen at arm's length, the close-up reads as a face. The medium reads as a torso. The wide reads as a beige rectangle. The single most reliable composition note I give short-form creators is "default one shot size tighter than your gut tells you," with the gut calibrated by people who learned cinematography on a 27-inch monitor.

Background as Set Design

Hoyte van Hoytema, who shot Oppenheimer for Christopher Nolan and won the Best Cinematography Oscar in 2024, told The Credits in a Motion Picture Association feature that he approaches cinematography "musically," meaning the rhythm of where the camera sits and what enters frame is composed in time, not just in space. The implication for short-form is that the background is a temporal decision, not a static one. What the viewer sees behind the creator in second three should not be identical to what they see in second twenty, or the eye gets bored and the brain starts looking for the scroll.

The fix is one of three moves. Either the camera moves (a slow push in or pull out, per the camera-movement guide). Or the creator moves through the frame (sitting forward, standing up, picking up a prop). Or the background contains a piece of incidental motion that changes (a window with traffic, a plant catching breeze, a screen behind the creator playing a related clip).

The background composition mistakes that drag the strongest scripts are the boring ones: a featureless beige wall behind a beige-shirted founder, three feet from the wall with no separation, the resulting flat image with no depth and no temporal motion. The fix is mechanical. Move four to six feet off the wall. Put a lamp or a plant in the foreground at the edge of frame. Add one piece of negative-space motion (a screen, a window, a moving element) somewhere in the frame. The composition reads as designed instead of as accidental.

What This Means for Your Next Shoot

Before you record, make four decisions out loud, in this order. First, shot size: medium close-up unless you have a reason to deviate. Second, angle: a touch below your own eye level so the viewer's eyeline reads as peer-to-peer. Third, where the eyes sit in the frame: upper-third of the safe rectangle (around pixel 660 from the top of a 1080x1920 canvas). Fourth, what goes in the negative space: a content-related background, a piece of on-screen text that is part of the composition, or a prop that reads as a context cue. (If you are auditing a competitor's frame composition section-by-section, the brand-analysis Superdirector runs surfaces the named hooks and openers a competitor uses repeatedly; one option among several.)

If any of those four decisions feels arbitrary, the composition will read as arbitrary. The named DPs above all defend their decisions in print. Yours should survive the same standard.

FAQ

Should I use the rule of thirds or center my subject?

Either, as long as you defend the choice. Roger Deakins told PetaPixel he has not consciously applied the rule of thirds since the 1960s. Robert Yeoman and Wes Anderson, per Cooke Optics's feature on their partnership, center their subjects on the architecture using measured camera placement. Both produce Oscar-tier work. The principle is intentional placement, not the specific grid.

What shot size should I default to on TikTok or Reels?

Medium close-up (chest to forehead) for talking-head, reaction, demo, and commentary content. Close-up (collar to forehead) for emotional beats. Extreme close-up for hooks. Wider sizes only when the body or environment is the subject. The phone screen rewards proximity.

Where exactly should I place text overlays on a vertical video?

Inside the safe rectangle, which on a 1080x1920 canvas is roughly the central area between pixel 150 from the top and pixel 1670 from the bottom (250-pixel bottom margin), and 130 pixels off each side. Per Kreatli's 2026 safe-zone documentation, text in the bottom 13% gets covered by platform-generated captions and action buttons. Stack text above the bottom 250-pixel band, never in it.

Is the Dutch angle a useful tool on phone-frame video?

Sparingly. Per American Cinematographer's 1998 feature on Saving Private Ryan, Janusz Kamiński and Steven Spielberg used canted angles to inject unease into the Omaha Beach landing. The phone-frame version of this read is identical (off-balance, urgent, wrong-feeling). Two-second cuts inside fast-paced edits land. Thirty-second talking-head sections shot Dutch look like a recording mistake.

Does the background matter if my script is strong?

Yes, because the viewer reads the background in the first half-second and the script does not start paying off for several seconds. Greig Fraser told Cinematography World that wide-shot composition on Dune was deliberately built around "sweeping landscapes of the desert planet Arrakis" alternating with close-ups so the viewer always knew where they were. The phone-frame version is humbler: a recognizable, content-related background that reads in under a second and does not contradict the script.

How wide should I shoot if my space is small?

Casey Neistat's signature, per In Depth Cine's breakdown of his vlog framing, is a Sony 12-24mm f/2.8 wide-angle lens at the wide end, positioned close to the subject. The combination pulls the viewer into the frame and exaggerates the sense of space. On a phone, the wide-angle native lens (the 0.5x or 13mm equivalent on recent iPhones) does similar work. Use it when the room is small enough that a normal focal length would feel cramped. Expect the edge distortion to feel intentional.

Continue Learning