

# Modular AWG-based Interconnection for Large-Scale Data Center Networks

Tong Ye, Tony T. Lee, Mao Ge, and Weisheng Hu yetong@sjtu.edu.cn

State Key Lab of Advanced Optical Communications and Networks Shanghai Jiao Tong University



### Outline

# Background

- AWG-based Interconnection
- Modular AWG-based Interconnection
- Application to Data Center Networks
- Conclusion



World-wide information service infrastructure



#### Google World-wide Data Center Map

#### Amazon Web Service's Global Infrastructure



[1] http://datacenterfrontier.com/regional-data-center-clusters-power-amazons-cloud/[2] Sushant Jain et. al., "B4: Experience with a Globally-deployed Software Defined Wan", ACM SIGCOMM, Oct. 2013, pp. 3-14.

# Footprint of Data Centers (DCs)





Number of Server Racks

## A mega DC requires a large number of long cables with very high capacity

# Cabling Problem



#### • Cable maintenance is extremely difficult, when

- network connections change
- line failures occur



#### Eventually, cables become a terrible monster...

[1] N. Farrington, E. Rubow, and A. Vahdat, "Data center switch architecture in the age of merchant silicon," in Proc. IEEE HOTI, Aug. 2009.

[2] www.hpl.hp.com/techreports/2015/HPL-2015-8.html

[3] J. Mudigonda, P. Yalagandula, and J. C. Mogul, "Taming the flying cable monster: A topology design and optimization framework for data- center networks," in Proc. ATC, Jun. 2011.

# Solution: Wireless Links



- Pros: reduce number of cables
- Cons:
  - Low bandwidth (~Gb/s)
  - Serious radio interference



K. Ramachandran, R. Kokku, R. Mahindra, and S. Rangarajan, "60 GHz data-center networking: Wireless => worry less?" Technical Report, NEC, 2008.
 Y. Cui, H. Wang, X. Cheng, and B. Chen, "Wireless data center networking," IEEE Wireless Commun. Mag., vol. 18, no. 6, pp. 46–53, Dec. 2011.
 N. Hamedazimi et al., "Firefly: A reconfigurable wireless data center fabric using free-space optics," in Proc. ACM SIGCOMM, Oct. 2014.

# Solution: Optimal Device Allocation



- Idea: combine several switches to form a high radix switch, but
  - specific for Butterfly networks (not universal)
  - reduce the number of cables only by half (not scalable)



J. Kim, W. J. Dally, and D. Abts, "Flattened butterfly: a cost- effective topology for high-radix networks," in Proc. ACM ISCA, Jun. 2007.



 Replace links of each full mesh by an arrayed waveguide grating (AWG)

- Pros: reduce cabling complexity + bandwidth guaranteed  $\sqrt{}$
- Cons: AWG is not scalable if network is very large





• Achieve modular AWG-based interconnection:

- Substantially reduce cabling complexity, while preserving function of original DC networks
- Scalable even when size of DC networks is very large
- Can be applied to different DC networks



# Outline

## Background

- AWG-based Interconnection
- Modular AWG-based Interconnection
- Application to Data Center Networks
- Conclusion

# **Topology of Existing Networks**





#### Different networks have the similar subnetwork

M. F. Bari et al., "Data center network virtualization: a survey," IEEE Commun. Surveys Tuts., vol. 15, no. 2, pp. 909–928, May 2013.
 M. Al-Fares, A. Loukissas, and A. Vahdat, "A scalable, commodity data center network architecture," in Proc. ACM SIGCOMM, Aug. 2008.
 Z. Zhu, S. Zhong, L. Chen, and K. Chen, "Fully programmable and scalable optical switching fabric for petabyte data center", Opt. Express, vol. 32, no. 3, pp. 3563-3580, Feb. 2015.



- Two disjoint node sets
- Exact one fiber link from a node in one set to that in another set









Wavelength ( $\lambda$ ) set  $\Lambda = \{\lambda_0, \lambda_1, \lambda_2, \lambda_3\}$ 

- Passive => consume no power
- Provide  $N_1N_2$  links between inputs and outputs



# Output# (*j*) is determined by input# (*i*) & $\lambda$ # (*k*): $j = [k - i]_{|\Lambda|} \stackrel{\text{def}}{=} (k - i) \mod |\Lambda|$



|      | OUT 0       | 0UT 1          | OUT 2          | 0UT 3          |
|------|-------------|----------------|----------------|----------------|
| IN 0 | λ           | $\lambda_1$    | $\lambda_2$    | λ <sub>3</sub> |
| IN 1 | $\lambda_1$ | $\lambda_2$    | λ <sub>3</sub> | λ              |
| IN 2 | $\lambda_2$ | λ <sub>3</sub> | λ              | $\lambda_1$    |

Cyclic Latin Square if  $N_1 = N_2$ 





# **AWG-based Interconnection**



• Replacing fiber links in  $\mathcal{N}_A$  by an AWG yields a network  $\mathcal{N}_B$ 



# Limitations of $N \times N$ AWG



## • If *N* is very large:

- In-band crosstalk is prominent (bad physical-layer performance)
- Synthesis is very difficult
- A large number of  $\lambda$ s are required





signals at the same  $\lambda$  interferes with each other

# Modular AWG-based Interconnection



## Phase 1: AWG decomposition

- Suppress in-band crosstalk
- Cut down synthesis difficulty

- Phase 2: Wavelength reuse
  - Reduce number of required wavelengths



## Outline

## Background

- AWG-based Interconnection
- Modular AWG-based Interconnection
  - Phase 1: AWG Decomposition
  - Phase 2: Wavelength Reuse
- Application to Data Center Networks
- Conclusion



#### • $N \times N \text{ AWG} \Rightarrow N \times N \text{ network of AWGs}$ :

same routing property

<=> output# is uniquely determined by input# and  $\lambda$ #



# Example of Decomposition





 $6 \times 6$  cyclic Latin square



|                           | $Q_0$       | $Q_1$          | $Q_2$       | $Q_3$          | $Q_4$       | $Q_5$       |   |  |  |
|---------------------------|-------------|----------------|-------------|----------------|-------------|-------------|---|--|--|
| $P_0$                     | $\lambda_0$ | $\lambda_1$    | λ2          | λ <sub>3</sub> | $\lambda_4$ | $\lambda_5$ |   |  |  |
| $P_1$                     | $\lambda_1$ | λ2             | λο          | $\lambda_4$    | $\lambda_5$ | $\lambda_3$ |   |  |  |
| $P_2$                     | λ2          | λο             | $\lambda_1$ | $\lambda_5$    | $\lambda_3$ | $\lambda_4$ | Λ |  |  |
| $P_3$                     | $\lambda_3$ | $\lambda_4$    | $\lambda_5$ | λ <sub>0</sub> | $\lambda_1$ | λ2          | A |  |  |
| $P_4$                     | $\lambda_4$ | $\lambda_5$    | $\lambda_3$ | λ1             | λ2          | λ           |   |  |  |
| $P_5$                     | $\lambda_5$ | λ <sub>3</sub> | $\lambda_4$ | λ2             | λ           | λ           |   |  |  |
| $6 \times 6$ Latin square |             |                |             |                |             |             |   |  |  |

# Example: Observation 1





|       | $Q_0$       | $Q_1$       | $Q_2$       | $Q_3$       | $Q_4$       | $Q_5$          |
|-------|-------------|-------------|-------------|-------------|-------------|----------------|
| $P_0$ | λο          | $\lambda_1$ | $\lambda_2$ | λ3          | $\lambda_4$ | λ <sub>5</sub> |
| $P_1$ | λ           | λ2          | $\lambda_0$ | λ4          | λ5          | $\lambda_3$    |
| $P_2$ | λ2          | $\lambda_0$ | $\lambda_1$ | $\lambda_5$ | λ3          | $\lambda_4$    |
| $P_3$ | $\lambda_3$ | $\lambda_4$ | $\lambda_5$ | λο          | $\lambda_1$ | λ <sub>2</sub> |
| $P_4$ | $\lambda_4$ | $\lambda_5$ | $\lambda_3$ | $\lambda_1$ | λ2          | $\lambda_0$    |
| $P_5$ | $\lambda_5$ | $\lambda_3$ | $\lambda_4$ | $\lambda_2$ | $\lambda_0$ | $\lambda_1$    |

A consists of 2<sup>2</sup> 3 × 3 cyclic Latin squares
Each square is associated with a 3 × 3 AWG

# Example: Observation 2





- $\mathbf{M}_0$  is defined on  $\lambda$ -set { $\lambda_0, \lambda_1, \lambda_2$ }
- $\mathbf{M}_1$  is defined on  $\lambda$ -set { $\lambda_3$ ,  $\lambda_4$ ,  $\lambda_5$ }



## Initialization:

• Define  $n \ r \times r$  cyclic Latin squares (nr = N)

$$\mathbf{M}_0, \mathbf{M}_1, \cdots, \mathbf{M}_{n-1}$$

• Specify an  $N \times N$  Latin square **A** with  $n^2 r \times r$  blocks  $\mathbf{A}_{ab} = \mathbf{M}_{[a+b]_n}$ 

#### • Construct an AWG network according to A:

- S1. Central stage construction
- S2. Upper-layer stage construction
- S3. Lower-layer stage construction





- Specify n r × r cyclic Latin squares, where
   N = 6, n = 2, r = 3
  - $\mathbf{M}_0$  is defined on  $\lambda$ -set  $\Lambda_0 = \{\lambda_0, \lambda_1, \lambda_2\}$
  - $\mathbf{M}_1$  is defined on  $\lambda$ -set  $\Lambda_1 = \{\lambda_3, \lambda_4, \lambda_5\}$

|   | λ <sub>0</sub>   | λ1             | λ <sub>2</sub>   |                |                | λ <sub>3</sub>   | $\lambda_4$ | $\lambda_5$     |    |
|---|------------------|----------------|------------------|----------------|----------------|------------------|-------------|-----------------|----|
|   | $\lambda_1$      | λ2             | λ <sub>0</sub>   | M <sub>0</sub> | $\mathbf{M}_1$ | $\lambda_4$      | $\lambda_5$ | $\lambda_3$     |    |
|   | λ <sub>2</sub>   | λ <sub>0</sub> | $\lambda_1$      |                |                | $\lambda_5$      | $\lambda_3$ | $\lambda_4$     |    |
| { | λ <sub>0</sub> , | $\lambda_1$    | , λ <sub>2</sub> | }              | {,             | λ <sub>3</sub> , | $\lambda_4$ | ,λ <sub>5</sub> | ;} |

Initialization: Specify A



 $\lambda_2$ 

 $\lambda_0$ 

 $\lambda_1$ 

 $\lambda_5$ 

 $\lambda_{A}$ 

 $\lambda_3$ 

 $\lambda_5$ 

 $\lambda_0$ 

 $\lambda_{1}$ 

 $\lambda_2$ 

 $\lambda_4$ 

ι5

 $\lambda_3$ 

 $\lambda_1$ 

 $\lambda_{21}$ 

 $\lambda_0$ 

 $\lambda_5$ 

 $\lambda_3$ 

 $\lambda_4$ 

 $\lambda_2$ 

 $\lambda_0$ 

 $\overline{\lambda}_1$ 





# S1. Central Stage Construction



• Layout  $n^2 r \times r$  AWGs from left to right

• Label *k*th AWG by A(a, b) and associate it with  $\mathbf{A}_{ab}$ , where  $a = \lfloor k/n \rfloor$ ,  $b = \lfloor k \rfloor_n$ 



# S1. Central Stage Construction



#### • Layout $n^2 r \times r$ AWGs from left to right

• Label *k*th AWG by A(a, b) and associate it with  $\mathbf{A}_{ab}$ , where  $a = \lfloor k/n \rfloor, b = \lfloor k \rfloor_n$ 





- Layout *N* DeMuxs at upper layer
- If *i*th row of **A** is  $\alpha$ th row of  $A_{ab}$ output *b* of DeMux *i*  $\leftrightarrow$  upper port  $\alpha$  of A(a, b)





- Layout N DeMuxs at upper layer
- If *i*th row of **A** is  $\alpha$ th row of  $A_{ab}$  ( $b = 0 \sim n 1$ ) output **b** of DeMux  $i \leftrightarrow$  upper port  $\alpha$  of A(a, b)





- Layout N DeMuxs at upper layer
- If *i*th row of **A** is  $\alpha$ th row of  $A_{ab}$  ( $b = 0 \sim n 1$ ) output **b** of DeMux  $i \leftrightarrow$  upper port  $\alpha$  of A(a, b)





• Layout *N* DeMuxs at upper layer

 $P_0$ 

 $P_1$ 

 $P_2$ 

 $P_3$ 

 $P_4$ 

 $P_5$ 

• If *i*th row of **A** is  $\alpha$ th row of  $\mathbf{A}_{ab}$  ( $b = 0 \sim n - 1$ )

output *b* of DeMux *i*  $\leftrightarrow$  upper port  $\alpha$  of A(a, b)





- Layout *N* Muxs at lower layer
- If *j*th col of **A** is  $\beta$ th col of **A**<sub>*ab*</sub>
  - input *a* of Mux  $j \leftrightarrow$  lower port  $\beta$  of A(a, b)







- Layout N Muxs at lower layer
- If *j*th col of **A** is  $\beta$ th col of  $A_{ab}$  ( $a = 0 \sim n 1$ ) input **a** of Mux  $j \leftrightarrow$  lower port  $\beta$  of A(a, b)





- Layout N Muxs at lower layer
- If *j*th col of **A** is  $\beta$ th col of  $A_{ab}$  ( $a = 0 \sim n 1$ ) input **a** of Mux  $j \leftrightarrow$  lower port  $\beta$  of A(a, b)



- Layout *N* Muxs at lower layer
- If *j*th col of **A** is  $\beta$ th col of **A**<sub>*ab*</sub> ( $a = 0 \sim n 1$ )
  - input *a* of Mux *j*  $\leftrightarrow$  lower port  $\beta$  of *A*(*a*, *b*)







## • $N \times N \mathcal{N}_B => N \times N \mathcal{N}_C(n, r)$ , where nr = N



 $\mathcal{N}_B$ 

 $\mathcal{N}_{C}(3,2)$ 



# Outline

- Background
- Preliminaries
- Modular AWG-based Interconnection
  - Phase 1: AWG Decomposition
  - Phase 2: Wavelength Reuse
- Application to Data Center Networks
- Conclusion

# Mux/DeMux Replacement





|       | $Q_0$       | $Q_1$       | $Q_2$       | $Q_3$          | $Q_4$       | $Q_5$                  |
|-------|-------------|-------------|-------------|----------------|-------------|------------------------|
| $P_0$ | λο          | $\lambda_1$ | λ2          | λ3             | $\lambda_4$ | $\lambda_5$            |
| $P_1$ | λ           | λ2          | $\lambda_0$ | λ4             | $\lambda_5$ | $\lambda_3$            |
| $P_2$ | λ2          | $\lambda_0$ | $\lambda_1$ | $\lambda_5$    | $\lambda_3$ | $\lambda_4$            |
| $P_3$ | $\lambda_3$ | $\lambda_4$ | $\lambda_5$ | λο             | $\lambda_1$ | $\lambda_2$            |
| $P_4$ | λ4          | $\lambda_5$ | $\lambda_3$ | λ              | λ2          | λ <sub>0</sub>         |
| $P_5$ | $\lambda_5$ | $\lambda_3$ | $\lambda_4$ | λ <sub>2</sub> | λ           | $\overline{\lambda}_1$ |



# Network After Mux Replacement



Connections passing through different AWGs are link-disjoint => reuse the same λ-set





## • $\mathcal{N}_D(n,r)$ where n = 2 and r = 3



 $\mathcal{N}_D(n,r)$  in General





# Comparison





#### Cabling complexity: $O(N^2)$ Number of required wavelengths: O(1)





# Outline

- Background
- Preliminaries
- Modular AWG-based Interconnection
  - AWG Decomposition
  - Wavelength Reuse
- Application to Data Center Networks
- Conclusion





#### • A 2-D 16,384-node DC network



Z. Zhu, S. Zhong, L. Chen, and K. Chen, "Fully programmable and scalable optical switching fabric for petabyte data center", Opt. Express, vol. 32, no. 3, pp. 3563-3580, Feb. 2015.

# AWG-based Interconnection Scheme



# • $\mathcal{N}_D(4,32)$ : 32 × 32 AWGs in the central stage





# **Physical Layer Performance**



### Power penalty (~0.7dB) is very small



## If Network is very Large ...



### Each central AWG is replaced by an integrated AWG-network module



# Conclusions



 AWG-based interconnection networks is proposed for DC networks

- Substantially reduce cabling complexity
- Only employ small-size AWG modules to avoid
  - serious in-band crosstalk
  - difficult synthesis technology
- Reuse same wavelength set, such that
  - number of required wavelengths is small
- Feasibility is confirmed by Physical-layer performance evaluations





# Thank you for your attention!