Friday, August 18, 2023

Meta announces Voicebox, a generative model for multiple voice synthesis tasks








‘Flow Matching’

Voicebox is a generative model that can synthesize speech across six languages: English, French, Spanish, German, Polish and Portuguese. Like large language models (LLMs), it has been trained on a very general task that can be used for many applications. But while LLMs try to learn the statistical regularities of words and text sequences, Voicebox has been trained to learn the patterns that map voice audio samples to their transcripts.
 
Replicating voices across languages, editing out mistakes in speech, and more

Unlike generative models that are trained for a specific application, Voicebox can perform many tasks that it has not been trained for. For example, the model can use a two-second voice sample to generate speech for new text. Meta says this capability can be used to bring speech to people who are unable to speak, or customize the voices of non-playable game characters and virtual assistants.

Voicebox also performs style transfer in different ways. For example, you can provide the model with two audio and text samples. It will use the first audio sample as style reference and modify the second one to match the voice and tone of the reference. Interestingly, the model can do the same thing across different languages, which could be used to “help people communicate in a natural, authentic way — even if they don’t speak the same languages.”

The model can also do a variety of editing tasks. For example, if a dog barks in the background while you’re recording your voice, you can provide the audio and transcript to Voicebox and mask out the segment with the background noise. The model will use the transcript to generate the missing portion of the audio without the background noise.

The same technique can be used to edit speech. For example, if you have misspoken a word, you can mask that portion of the audio sample and pass it to Voicebox along with a transcript of the edited text. The model will generate the missing part with the new text in a way that matches the surrounding voice and tone.

Model not released


There is growing concern about the threats of AI-generated content. For example, cybercriminals recently tried to scam a woman by calling her and using an AI-generated voice to impersonate her grandson. Advanced speech synthesis systems such as Voicebox could be used for similar purposes or other nefarious deeds, such as creating fake evidence or manipulating real audio.

“As with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm,” Meta wrote on its AI blog. Due to these concerns, Meta did not release the model but provided technical details on the architecture and training process in the technical paper. The paper also contains details about a classifier model that can detect speech and audio generated by Voicebox, to mitigate the risks of using the model.

Thursday, August 10, 2023

The Year of NFC Wireless Charging


 

Why NFC wireless charging?

NFC technology is one of the most ubiquitous technologies on the planet: One-quarter of the world’s population already has access to it. According to a 2022 report by ABI Research, 85% of consumers use NFC technology, and nearly half prefer using their mobile phone or smartwatch to a contactless card for payment.
This pervasiveness is fueling the rise of NFC wireless charging in 2023 across mobile devices, including smartphones, smartwatches, fitness trackers and other wearables, as well as tablets and other handheld devices. Most of these already use NFC technology for data transfer and communication.
The drive toward NFC wireless charging started three years ago, when the NFC Forum released a wireless charging specification that transformed smartphones or other NFC-enabled devices into portable charging stations. This means that NFC-enabled devices can, without cables, wirelessly charge small battery-powered IoT devices with 1 W and eventually up to 3 W of power. These devices include fitness watches, wireless earbuds, digital stylus pens, smart glasses and other devices.
Aside from conventional charger-receiver setups, NFC wireless charging can occur when earbuds are placed on top of an NFC-enabled smartphone or when an active digital stylus pen is placed on its tablet device. In the latter example, the Universal Stylus Initiative (USI) in collaboration with the NFC Forum piloted NFC wireless charging and rolled it into the specification for its digital stylus. USI is the non-proprietary active digital stylus technology leader and the only open digital stylus standard available. It is standard on Google Chromebooks.
The benefits of NFC wireless charging stem from the ability of the small NFC antenna in an NFC-enabled device to both communicate with and wirelessly charge the enabled devices. The benefits include:
• Convenience of charging
• Better accessibility, as the approach eliminates the need for cords
• Reliability, as there are no connectors to corrode
• Better sustainability through reduced e-waste with the removal of cables and disposable batteries
• Easier use for those with skin sensitivity to metal (up to 15% of the general population has a hypersensitivity to metal)
• Enabling the use of mobile phones for device programming and configuration

De facto standard for less than 3 W

The NFC Forum and industry analysts predict that this year, we’ll see the beginning of a parade of products with NFC wireless charging technology. Many of the major industry players are already rolling out chipsets for NFC wireless charging, and the opportunity is significant.
Michael Stark, principal standards architect at NXP Semiconductors, was an author of the NFC wireless charging specification and is vice chair of the NFC Forum Board of Directors. “We created the NFC wireless charging specification to be a simple, certified technology to wirelessly charge small IoT devices using NFC when cables are not an option,” he said. “Our goal is for NFC wireless charging to become the wireless charging de facto standard for powering small IoT devices up to 3 W.”

Complements Qi

NFC wireless charging complements the Wireless Power Consortium’s Qi wireless charging standard and strengthens the overall wireless charging ecosystem. Qi is focused on wireless charging above 5 W, while NFC wireless charging currently targets 250- to 1,000-mW charging, with the potential to increase support to at least 3 W.
Qi and NFC already complement each other as an in-console solution. Qi chargers typically also implement NFC card protection. Moreover, this NFC interface could wirelessly charge items like automobile key fobs.
The enterprise device market is also a significant opportunity for NFC wireless charging technology, bringing advantages and design savings.
NFC wireless charging could serve as a complementary power-charging source and could help to address many enterprise-market use cases once it matures above 3 W, according to Marek Trusinski, engineering fellow, solutions architect, technology lead and NFC subject matter expert at Zebra Technologies. This is one of the reasons that plans call for expanding NFC wireless charging beyond 3 W to perhaps 5 W in the future, as enterprise devices typically have higher charging requirements.
In his latest report on the global wireless power market, Dinesh Kithany, chief analyst at Wired and Wireless Technology, predicts robust growth for the high-frequency–based wireless charging market, including NFC charging, with rapid adoption and a forecast of 100 million receiver units by 2026. Given the pervasiveness of wireless power transmitter units, such as smartphones and/or tablets enabled with NFC, the opportunity for NFC charging could double the shipment of receiver units.

Chip vendors lead the way

STMicroelectronics and NXP are among the NFC wireless charging pioneers. Products in ST’s ST25 family of NFC and RFID tag and reader ICs feature NFC wireless charging for a variety of NFC-enabled solutions.

“NFC wireless charging provides easy integration into small applications through tiny antenna footprints, with targeted antenna dimensions ranging down to below 0.15 in.2 [1 cm2],” said Peter Neger-Loibner, NFC product marketing manager at STMicroelectronics. “Using PCB or flex-PCB antennas, these products can be an ideal fit for complex form factors. It is this convenience and versatility that make NFC wireless charging so useful for consumers and easy to integrate for developers.”

Panthronics AG also added wireless charging functionality to its product line. Its next-generation system solutions include PTX130W, an NFC wireless charging poller IC, and PTX30W, a fully integrated listener IC.

“Our goal is to simplify NFC wireless charging technology, making things easier for our customers and accelerating its adoption into new applications,” said Alessandro Goitre, director of product management at Panthronics. “This does not mean we compromise on performance; we continue to push the boundaries on what is achievable. The appetite for leveraging the benefit of NFC wireless charging continues to grow. I think we’re seeing the beginning of what will prove to be an amazing era for NFC-based technology in digital styluses, wearables, smart glasses and [smart] rings, hearables and medical sensors. All these devices will improve the customer experience through NFC wireless charging.”

Infineon Technologies is developing a new NFC wireless charging turnkey solution with an emphasis on the power density (received power/PCB size) and the integration of system functions, such as thermal control and battery charging.

“Our goal is to help customers achieve a very cost-effective implementation of NFC functions,” said Qi Zhu, director of product marketing and business development at Infineon. “Reducing the total ownership cost is one aspect. The PCB size limitation of compact devices is another reason customers are asking for a higher level of integration.”

Cooperation makes a better future possible

It takes an enormous amount of work to set a firm foundation for an entire industry. The NFC Forum brings together more than 300 organizations, with forum members collaborating with dozens of technical organizations to support the NFC ecosystem.

“To ensure a use case can adopt a certain technology where many different players are involved, you need a reliable base that enables interoperability among the different providers, and it must be future-proof,” NFC Forum’s Stark said. “Worldwide standardization is the only way to do that, and that’s exactly what the NFC Forum brings to NFC wireless charging.”


#MediaIndustry
#DigitalMedia
#MassCommunication
#MediaTrends
#MediaStudies
#MediaTech
#Broadcasting
#Journalism
#SocialMedia
#ContentCreation


Wednesday, August 9, 2023

Neural vocoder and its application in speech recognition













What is a Neural Vocoder?

The artificial recreation of human speech with a computer or other technology is known as speech synthesis. Speech synthesis, the opposite of voice recognition, is usually utilized for converting text information into audio information and in applications such as voice-enabled services and mobile applications. It is also utilized in assistive technology to aid vision-impaired persons in reading written material.
A vocoder is a signal processing device that uses feature representation to synthesize the voice waveform. Classic vocoder settings are prompted by an underlying speech production model and include appropriate encodings of the fundamental frequency, spectral envelope, and other factors. Feature sequences are often created on a considerably coarser temporal scale than the target audio source. In the case of voice synthesis, the analytical technique can be replaced with a generator that generates the vocoder settings directly.
Neural vocoders are a frequent component in speech synthesis pipelines that transform the spectral representations of an audio stream into waveforms. It is primarily concerned with generating waveforms from low-dimensional representations such as Mel-Spectrograms.
There are three main categories of neural vocoders autoregressive models, GAN-based models, and diffusion models.

Autoregressive Models

The autoregressive models are distinguished by the fact that they are built as probabilistic models that forecast the likelihood of each waveform sample based on prior samples. This enables the generation of a high-quality, natural-sounding voice signal. However, the total synthesis pace is poor when compared to other approaches due to the sample-by-sample production procedure. There are two commonly used autoregressive models, WaveNet and WaveRNN.

GAN Models

GAN-based vocoders have consistently outperformed autoregressive models in terms of speed and quality of synthesised speech. They use the basic principle of GANs, employing a generator to represent the waveform signal in the time domain and a discriminator to improve the quality of the generated speech. Various GAN-based vocoder variations have been introduced. MelGAN and Parallel WaveGAN are two examples of representative models.
Diffusion Models
Diffusion probabilistic models are another generative model that includes two primary processes: diffusion and reversal. The diffusion process is described as a Markov chain process in which Gaussian noise is gradually added to the original signal until it is destroyed. In contrast, the reverse procedure is a denoising process that gradually eliminates the additional Gaussian noise and recovers the original signal. There are two commonly used diffusion-based vocoders WaveGrad and DiffWave.

Where is it used?

The majority of speech-generating algorithms, such as text-to-speech and voice conversion, do not directly create waveforms. Instead, acoustic features such as Mel-spectrograms or F0 frequencies are output by the models. Waveforms can traditionally be vocoded from auditory or linguistic data using heuristic algorithms or constructed vocoders. However, the quality of the produced speech is severely constrained and degraded due to the assumptions behind the heuristic approaches.


Since Google’s Tacotron 2 used WaveNet as a vocoder to create waveforms from Mel-spectrograms, neural vocoders have steadily become the most used vocoding approach for voice synthesis. Nowadays, neural vocoders have mostly superseded previous heuristic approaches, significantly improving the quality of produced speech. WaveNet creates high-quality waveforms but requires a considerable inference time owing to its autoregressive design.

How does it work?

Two primary units in a neural vocoder are signal generation and the signal processing unit. The signal processing unit differentiates between the audio signal received as an input. The signal processing operator makes the whole model work as an auto-encoder. The model is fed control parameters (fundamental frequency) derived from a particular signal, which it must recreate by adjusting the different signal model parameters. Surprisingly, the signal parameters derived from an input signal may be changed during inference before being delivered to the front-end (signal generation). The signal generation unit has the DNN, the DNN controls the parameter of the signal generation. If the DNN controller was learnt with enough examples and the input parameter combinations remained within reasonable ranges, the DNN controller should be able to convert the input parameters into the most appropriate signal model parameters.
This signal processing and signal creation arrangement is extensively utilised for two reasons: It addresses a long-standing issue with high-level control of complex signal processing models.
Training the model with fewer data may be accomplished by forcing some structure into the DNN. The fundamental challenge is to design a structure that is well fitted to the target domain without limiting the model’s expressivity.
The source-filter architecture is a strong contender for dividing a vocoder into signal processing operators and DNN modules. The addition of a filter component to a neural vocoder does not imply any restrictions on the signals that the vocoder may represent. Linear predictive coding (LPC), which can be applied to arbitrary input signals, is one of the most fundamental source-filter models.





Application of Neural vocoder in speech recognition

In general, speech recognition tries to enable spoken communication between humans and computers. The goal is to harmonize information from several disciplines of expertise and to acquire an appropriate interpretation of an auditory message despite potential ambiguities and inaccuracies.
The famous neural vocoder used for acoustic speech recognition is Wavenet. WaveNet is a combination of two different ideas: wavelet and Neural networks. Wavenet is an autoregressive convolutional neural network model. The wavenet processes the raw audio which is represented as a sequence of 16 bits samples. The problem with these 16 bits of raw sampled audio is the quantization value which makes it expensive to process through the softmax distribution. So the first step done by the wavenet model is to downsample or compress the audio signal to 8bit by using the μ transformation law.
As the Wavenet follows the feed-forward architecture, the raw audio is processed in the causal convolution layer and then sent ahead in the gated activation unit which acts as a filter for the amount of information to be passed to the next layer. The final output from these layers is fed to the ReLU, and at last, the signal is processed through the Softmax distribution.

Conclusion

Instead of employing standard approaches that feature audible artefacts, neural vocoders based on deep neural networks may produce human-like voices. The neural vocoder is different from the autotuner used for tuning the voices. With this article, we have understood the Neural vocoder and its application in speech recognition.


#MediaIndustry
#DigitalMedia
#MassCommunication
#MediaTrends
#MediaStudies
#MediaTech
#Broadcasting
#Journalism
#SocialMedia
#ContentCreation












Lightweight aerial image object detection algorithm based on improved YOLOv5s


 


Abstract

YOLOv5 is one of the most popular object detection algorithms, which is divided into multiple series according to the control of network depth and width. To realize the deployment of mobile devices or embedded devices, the paper proposes a lightweight aerial image object detection algorithm (LAI-YOLOv5s) based on the improvement of YOLOv5s with a relatively small amount of calculation and parameter and relatively fast reasoning speed. Firstly, to better detect small objects, the paper replaces the minimum detection head with the maximum detection head and proposes a new feature fusion method, DFM-CPFN(Deep Feature Map Cross Path Fusion Network), to enrich the semantic information of deep features. Secondly, the paper designs a new module based on VoVNet to improve the feature extraction ability of the backbone network. Finally, based on the idea of ShuffleNetV2, the paper makes the network more lightweight without affecting detection accuracy. Based on the VisDrone2019 dataset, the detection accuracy of LAI-YOLOv5s on the mAP@0.5 index is 8.3% higher than that of the original algorithm. Compared with other series of YOLOv5 and YOLOv3 algorithms, LAI-YOLOv5s has the advantages of low computational cost and high detection accuracy.

Introduction

With the continuous application of UAVs in modern life, aerial photography technology has been widely used in various fields such as civil or military. Target detection of aerial images is one of the important parts of intelligent transportation system. Positioning and tracking of ground vehicle targets through aerial photography technology can convey and reflect the ground traffic information more clearly. And it is helpful for the construction of mature intelligent transportation system. Due to the large size of aerial images and the small and dense objects such as vehicles in the images, the detection accuracy of this detection task is low1. Traditional vehicle detection methods in aerial images usually adopt sliding window method, while in the process of feature extraction, fixed-size Windows and hand-crafted features often affect their detection accuracy2. In addition, compared with common object detection, highly complex background and variable object appearance further increase the difficulty of object detection in aerial images3.

Deep learning with nonlinear models has been widely used in object detection, and it can transform input data features into more abstract features. The algorithm can automatically discover the features needed for classification or detection tasks, and it has powerful representation and learning capabilities. In 2015, He et al.4 proposed a ResNet residual network through a cross-layer connection, which improved the network's performance, and it had no effect on error while increasing the depth of the network. In 2014, the R-CNN algorithm proposed by Girshick et al.5 used the proposal box extraction method to segment the input image into multiple modules and merge these modules according to the similarity information, which could obtain about 2000 candidate regions of different sizes. This is a two-stage target detection method, and it has the slower detection speed and the poorer real-time detection. Therefore, single-stage object detection method is proposed, and it can directly obtain the final output results on the original image. The YOLOv1 object detection algorithm proposed by Joseph Redmon et al.6 in 2015 treated the object detection task as a regression problem and removed the branch of extracting candidate boxes. Its detection speed was far faster than the two-stage object detection algorithm. The YOLOv37 proposed by Joseph Redmon et al. in 2018, which used Darknet-53 with better effect as the backbone network, adopted multi-scale fusion prediction based on FPN8 and used three sizes of feature maps to detect objects. Bochkovskiy et al.9 proposed YOLOv4 in 2020, which adopted image data enhancement technology at the input end, carried out multi-channel feature fusion based on PANet10, and adopted CIoU as the position loss function of regression box, which had greatly improved the detection speed and detection accuracy.

The work in this paper includes three parts:

1.In order to detect the small and many features in detection tasks, the paper proposes a new Feature Fusion Network DFM-CPFN (Deep Feature Map Cross Path Fusion Network). The medium-size detection head in the original algorithm is replaced by the enormous-size detection head after two upsampling operations, and then it is fused with the features in the backbone network respectively, which enriches the location information of the in-depth features.

2.In order to solve the problem of gradient disappearance caused by network deepening, the paper designs a VB module based on VoVNet20 to improve the backbone network. On the premise of retaining the residual structure, the output of multiple convolutional layers is spliced together at the end, which better ensures the transmission of features and gradients, and avoids feature redundancy.
3. Due to the high computational cost, object detection algorithm is difficult to deploy on mobile devices with limited performance. In order to solve the problem, the paper designs the C3SFN module based on ShuffleNetV221, which can make the improved algorithm model more lightweight. The computational cost of the algorithm is effectively reduced.

YOLOv5s methods

network structure diagram of YOLOv5s. YOLOv5 is divided into multiple series such as s, m, and l by controlling the depth and width of the network, and their differences are only in the different scaling multiples. The series with a profound or vast network has relatively good detection effect, but the computational cost is also relatively high. The series with a shallow network has significantly reduced computational cost and faster detection speed, but detection effect is relatively poor. Based on the idea of CSPNet22, the C3 module is added in YOLOv5s backbone network, which divides the feature map into two paths and uses the cross-stage hierarchical structure to merge. The new network architecture realizes richer gradient combinations while reducing the amount of calculation. The SPPF module with a spatial pyramid pooling structure borrowed the idea of SPPNet23. The neck network of YOLOv5 performs multi-scale feature fusion based on PANet. Compared with FPN, PANet adds a bottom-up feature fusion path, and its output head adds a fully connected branch to improve the quality of the prediction mask.

LAI-YOLOv5s

Firstly, a new Feature Fusion Network DFM-CPFN (Deep Feature Map Cross Path Fusion Network) is proposed, which can effectively improve the problem of profound information loss of deep features for small targets. In addition, based on VoVNet and ShuffleNetV2, the paper designs two new modules, VB and C3SFN, respectively. Two new modules can improve the feature extraction performance of the backbone network, meanwhile have a lightweight network. Compared with some other object detection algorithms and other series of YOLOv5, ablation experiments show the proposed algorithm not only has a more lightweight network model, but also has better performance in detection accuracy and detection effect. Figure 2 shows the network structure of LAI-YOLOv5s.


#MediaIndustry
#DigitalMedia
#MassCommunication
#MediaTrends
#MediaStudies
#MediaTech
#Broadcasting
#Journalism
#SocialMedia
#ContentCreation

Web RTC:

WebRTC (Web Real-Time Communication) is an open-source project that enables real-time communication capabilities directly within web browser...