BLOG

Synthesis on the ChatGPT wave

What is ChatGPT?

ChatGPT is a conversational agent: you type a question or a task request related to language with your keyboard, and the agent uses artificial intelligence to understand and respond to the writing in natural language. This is called a Large Language Model AI.

An AI consists of several layers. The first layer of ChatGPT is GPT-3 (in its version GPT-3.5) made of 175'000 billions of parameters hence the difficulty to understand its functioning. We can approach it by testing this gas factory (for example by checking the biases of under-representation of certain social groups).

This language model is a statistical model, it does not reason. It composes its response sentences according to the highest probability of presence of a word after another that it has learned during its training. Hence the importance of training data.

Who created it?

Open IA logo Open AI, a leading American artificial intelligence company founded in 2015 by, among others, Reid Hoffman (founder of Linkedin), Peter Thiel (founder of Paypal and venture capitalist, openly libertarian and trans-humanist) or Elon Musk, who withdrew board of directors of the company in 2018.

Microsoft has repeatedly invested billions of dollars and finally licensed the model exclusively in late 2022. Microsoft is most certainly counting on a return on investment and OpenIA will certainly not be so open in some time 😉 ChatGPT was launched in open access in November 2022, it is free with registration (see below the business model).

As you will discover in the following chapters, AIs need a large amount of data to be trained. It is thanks to the quantitative explosion of digital data with the multiplication of connected devices including mobile, the massive use of social networks, the rise of the Internet of Things, and the progress of storage capacities that AIs could be fed.

How does ChatGPT work?

The agent needs a large amount of data, mostly textual data written by humans. These data have been collected to train the AI of the agent.

The training is done in several phases

Training the AI
  • The first is the construction of the language model which would require about 300'000 billion words (note that ChatGPT is available in several languages). The language model predicts the word coming after another word according to regularity rules that it will have read previously in large quantities of texts. For example, when you write a Whatsapp message, suggestions appear when you type a word. These words are suggested according to their known probability of appearing after the previous word. It is therefore a matter of statistics, the AI has no understanding of what it writes.
  • The second phase is an education phase (finetuning) where the human intervenes. Many annotators will guide the model during its training or testing phase. It should be noted that these annotators are often people located in developing countries who are paid 2$ per hour…
  • At the first stage of use, annotators rewrite and correct answers to perfect the model, this is the reinforcement learning phase of the AI (reinforcement learning with humans feedbacks). If there is a notion of ethics, it is in this phase that it can intervene. But it is the ethics of the developers, trainers or those who give them their instructions. For example, when ChatGPT is asked “tell me a joke about blondes”, it refuses and answers that this is considered discriminatory as for any specific group. It is not the learning phase that allowed ChatGPT to generate this answer, this answer was programmed by the AI developers.

In other words, the AI learned to imitate what it was trained to do from the web. I insist that this language model is a statistical model, it does not reason, it composes these response sentences according to the highest probability of presence of a word after another. Hence the importance of training data.

Where do the training data come from?

  • Data downloaded from the web e.g. from Wikipedia, digitized books, networks, etc. until 2021. Was there any fakenews in the training data? Impossible to say for sure but since social networks are part of the sources it is very likely that they were… Was the learning reinforcement phase sufficient to exclude these fakenews? Impossible to say…
  • ChatGPT is not directly connected to the Internet. However Microsoft has indicated that the version that will be integrated into its search engine Bing will be.
  • These training sources are not explicitly named by the creators (this lack of transparency is problematic in relation to the following point) either when formulating its answers or on its presentation site.
  • Training data results in the quality of an AI’s response. When we look at the spelling, grammar and syntax of ChatGPT’s answers, we can conclude that its data sources were quite good in this respect (without being extraordinary either). The answers are very convincing from a writing point of view but they may be wrong in their accuracy.

What is the business model of ChatGPT?

Business model Design, training, annotations, servers (high power consumption) and Cloud service (Google Cloud would cost 120 million per year to OpenIA) represent hundreds of millions of dollars of investment… The commitment of these important costs would be motivated by the distance from the competition.

But what about profitability? Historically, the AI of large digital platforms are used for ad placement. Will ChatGPT follow the same path eventually? At launch in November 2022, ChatGPT is free (with registration). But since February 10, a ChatGPT Plus version is available for $20/month. This version offers greater accessibility for large numbers of users, faster response times, priority access to new features.

David Chavalarias, research director at CNRS and author of Toxic Data, suggests that only ChatGPT authors will be able to distinguish whether a text was generated by ChatGPT or not. These authors will therefore be able to sell their detection service; a service that will be very useful to social networks, for example, which risk being flooded with texts generated by ChatGPT.

What are the uses of ChatGPT?

Uses of ChatGPT For the moment, ChatGPT is available for testing via https://chat.openai.com by creating an account. These free tests, quite bluffing, allow to train the AI further and certainly also to encourage the acceptance of the concept by the general public for future more controversial applications 😉.

ChatGPT can do a lot of favors whenever writing is involved. Just look at the number of Linkedin posts dealing with content creation tips and other mailings or CRM optimizations… The enthusiasm is there…. and so are the fears (see issues below). Articles about ChatGPT are coming out every day in the media (and it was actually hard for me to stop my exploration to write this article as the revelations are so infernal 😅).

ChatGPT can be useful for: brainstorming, getting inspired, helping to formulate your thoughts, summarizing a long text (that you wrote yourself!), creating drafts of emails, articles or even lines of code but these drafts will have to be made reliable and adapted to avoid the risk of spreading false information (see below the example of the Google Dart announcement). ChatGPT is not designed to search for the truth or generate content ready to publish without reviewing it first. Many claim that it saves time in writing… However, it must be considered that time is needed to check the validity of the answers made by ChatGPT, to rework the generated text, to personalize it at the risk of spreading false information or multiplying uniform texts without soul and without SEO scope.

We can say that ChatGPT makes “knowledge” available, which is a good thing, just like Wikipedia did when it was launched. But the “knowledge” generated by ChatGPT is not secure, although it is coherent, plausible and sometimes very convincing. This “knowledge” needs to be fact-checked. Microsoft promises integration with its software suite: text generation in Word, Excel, Powerpoint or code generation in Visual Studio… To be tested in due time.

Notion.so IA The GPT3 layer of OpenIA can be used under license by other companies with their own objectives. For example, Notion.so (a widely used digital platform that offers powerful tools for collaboration, organization and project management) integrates the GPT-3.0 version into its Notion AI.

Note that in order to use ChatGPT, it is important to know how to do it, from a technical and ethical point of view. A good formulation of questions is key to get better answers, a little advice: the questions must include a precise task, with a context, precise instructions and the expected style.

Who are the competitors?

Google Google will soon launch its own conversational agent, which will be called Google Bard. The differences:

  • This conversational agent will be connected to the Internet.
  • The training source data will probably come from the whole Google eco-system: Google Search, Google Books, Youtube, Google Drive and its documents, Google Car, etc… Imagine the extent of the data available to train Google Dart 😬.
  • In early February 2023, Google wanted to reassure its investors in the face of the ChatGPT surge, and announced the upcoming launch of Google Bard during its Live from Paris. Sundar Pichai asked a question about the latest discoveries of the James Weeb telescope and one of the 3 answers was wrong as astronomy enthusiasts noticed. This error did not reassure investors at all, which led to a 7% drop in the value of Alphabet shares. Is this investor distrust justified? TechCrunch believes that Google is losing control of some of its most important products and services, such as YouTube and the advertising platform. The cause is a lack of leadership and clear direction from the company, as well as increasing pressure from regulators and competitors. Google faces significant challenges. However, Google is a very large and powerful company, with a huge database, a rich source of data to feed one (and more) AI.

Baidu also plans to launch its conversational agent, Ernie Bot, in March this year. Baidu is a search engine and also a leader in data mining. It has enough to train an artificial intelligence. Knowing that China is also an expert in collecting personal and sensitive data from its citizens, let’s hope that this data is not used to train this agent 😬.

Meta and Amazon are also in the running with their own AI research. Business to follow.

What are the stakes?

Facing the stakes All professions related to writing are challenged: writers, editors, translators, proofreaders, novelists, journalists, screenwriters, bloggers and writers, communication, advertising, marketing professions, (even troll farms 😜)… And also professions that produce many reports, articles, or press releases such as lawyers, doctors, scientists, or the world of teaching. Everyone is wondering about this new situation.

Some examples:

  • Regarding teaching, Swiss teachers got together to find the Achilles heel of ChatGPT. Some of them felt helpless in front of this new type of cheat. At the seminar they looked for ways to trap cheaters and re-evaluate their teaching; oral exams might become more frequent. Despite the concerns, some also see potential positive applications in using ChatGPT to create discussion bases for the generated texts.
  • The journal Nature is concerned about the influence of such a tool on scientific publications. A study revealed that reviewers detected the use of chatGPT in only 68% of publication abstracts (a software did not do better with 66% of cases). Integrity of the researchers, validity of the publications, risk of hyper-publication (more publications = more notoriety), these fundamental principles are likely to be abused…
  • ChatGPT is about to integrate MS Bing, Dart will be integrated to Google Search. These are search engines. They will become search and questioning engines. The results pages will therefore be turned upside down. On the web, a set of techniques to improve the visibility and ranking of websites in search engine results is called SEO. Professionals in the field aim to improve the ranking of a website in the search engine results by optimizing the content, code and design of the site because the higher a site is in the list of results, the more visitors it will get. The addition of the conversational agent brick in search engines, risks to focus users on the generated answers and they may not click on the links to the websites relevant to their search. Media sites have already faced this problem with Google News. Knowing that the search engine results page also includes sponsored results (advertisers pay to have their site appear above the natural results), that this sponsorship is an important revenue for search engines, how will they manage these conflicts between users and advertisers? So many challenges from a business and UX Design point of view 🤯 We understand better why platforms have not yet implemented this conversational agent functionality in their search engine.

The Swiss Confederation is considering these conversational agents with attention. The State Secretariat for Education, Research and Innovation has stated that the ability to use artificial intelligence in an appropriate way is crucial to participate in society and the world of work.

These tools exist and it is essential to understand how they work, to question their ethics (and ours in case of use), to know how to use them if necessary, and to regulate them if necessary. So much food for thought for us at the Swiss Institute for Digital Responsibility.

My conclusion at this point

Conclusion

ChatGPT is a tool and should remain so: it is not a collaborator or a colleague. If you use ChatGPT (or any other conversational agent to come), never forget that its answers are formulated according to their probability of occurrence and not of truth, a verification and validation step on your part will be essential. And ask yourself about the ethics of your use of this technology before you use it 😉.

Bibliography

Ethics and responsibility

You wonder about the stakes of your digital services.

Let's discuss sustainable digital
 

Other posts

🌱 This website is eco-designed.
It is sustainable and low-consumption.

  • Its A average rating is maximum, obtained according to the environmental performance tool www.ecoindex.fr. For an order of magnitude, 100 visits per month consume 2.02 l of blue water and emit 135 gCO2e of greenhouse gases.
  • The average weight of the pages of the website is less than 500 Kb.
  • Lighthouse Scores: Performance 95%, Accessibility 94%, Best Practices 100%, SEO 92%.
  • Green hosting at www.infomaniak.ch.