Perplexity.ai is an ai project that works similarly to other LLMs like ChatGPT except that it provides links to verifiable sources of the information that it provides.
In an era where data privacy is paramount, especially in sectors like healthcare and government, the need for secure, in-house data processing tools has never been more critical, but hallucination means you need to confirm the answer with verifiable sources. Recognizing this, I embarked on a journey to enhance an existing AI tool – an open-source Perplexity.ai clone found here – adapting it for local model use and planning to integrate robust local full-text search capabilities with PostgreSQL. This blog post delves into this innovative project.
Upgrading the Perplexity Clone: My journey began with searching for existing code, as basically all projects should. I found a Python clone of Perplexity which I modified to operate with a local Mistral-7B model. This significant enhancement is not only novel but also ensures data remains within the confines of your local environment. Such a setup is crucial for handling sensitive information, particularly in medical and government sectors where data privacy is non-negotiable. Please note that in the current version the initial query is googled and the results indexed locally so that further queries can be run locally, but sources of data other than Google can be used such as Wikipedia, WikiMed, and Medical Stack Exchange Dumps.
Improving Data Scraping and Readability: I have upgraded the existing Google scraping script, addressing the challenges posed by Readability.js. Occasionally, the script struggles to parse certain web pages, necessitating manual intervention. This upgrade aims to streamline the process, reducing the need for manual corrections.
Future Plans for Local Full-Text Search with PostgreSQL: A future goal of this project is going to be the integration of PostgreSQL for full-text search of massive sources of data such as those mentioned above. I am currently developing a script to process wiki XML dumps into a SQL PostgreSQL database. This feature will allow users to perform comprehensive searches within wiki pages offline without even the origin query hitting the internet or google, and to allow access references directly without this. It’s a game-changer for verifying facts and sourcing information securely.
Benefits of In-House Data Processing: One of the primary benefits of this software is the ability to process data internally, without relying on external entities. This approach is crucial for complying with data protection regulations and mitigating privacy concerns. By processing data in-house, organizations can avoid the legal and ethical complexities of obtaining additional permissions for data usage.
Future Applications: The applications of this enhanced AI model are vast and varied. From medical institutions leveraging it for patient data analysis to government bodies utilizing it for secure document processing, the possibilities are endless. Additionally, any organization prioritizing data protection can benefit from this technology.
Full Code Availability: Transparency and collaboration are key in the tech world. To this end, the full code for this project is available, inviting developers and enthusiasts to explore, modify, and potentially enhance its capabilities. I invite you to share your thoughts on potential uses for this code or to report any issues you encounter. Your insights and experiences are invaluable in refining and expanding the application of this tool.
The code provided below is licensed under the MIT Open Source License, allowing commercial use and modification. Running this project will require at least 32GB of RAM and will take about 30 minutes to an hour to calculate the embeddings for the initial search, but about 2 minutes for each follow-up question that doesn’t require new documents to be indexed. The MIT license is provided below:
MIT License
Copyright (c) 2024 Felix Farquharson
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
import requests
from bs4 import BeautifulSoup
import numpy as np # Required to dedupe sites
from readabilipy import simple_json_from_html_string # Required to parse HTML to pure text
from langchain.schema import Document # Required to create a Document object
from langchain.chains import VectorDBQA # Required to create a Question-Answer object using a vector
import pprint # Required to pretty print the results
from langchain.text_splitter import CharacterTextSplitter
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
params = {
"q": 'history of the human genome project', #this is the first query which includes the precise subject that your followup questions are related to
"hl": "en",
"gl": "uk",
"start": 0,
}
page_limit = 1
page_num = 0
urls=[]
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
link = result.select_one(".yuRUbf a")["href"]
urls.append(link.split("#")[0])
if page_num == page_limit:
break
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
urls = list(np.unique(urls))
print(urls)
documents = []
for url in urls:
req = requests.get(url)
article = simple_json_from_html_string(req.text, use_readability=True)
if article["plain_text"]:
documents += [Document(page_content=article['plain_text'][0]['text'],
metadata={'source': url, 'page_title': article['title']})]
print(documents)
text_splitter = CharacterTextSplitter(separator=' ', chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
len(texts)
#download mistral with ollama pull mistral
embeddings = OllamaEmbeddings(model="tinyllama")
llm = ChatOllama(
model="tinyllama",
temperature=0,
)
docsearch = Chroma.from_documents(texts, embeddings)
# First followup question
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
result = qa({"query": "Who were the main players in the race to complete the human genome? And what were their"
" approaches? Give as much detail as possible."})
pprint.pprint(result)
#second followup question
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
result = qa({"query": "How were the donor participants recruited for the human genome project? "
"Summarize in three sentences."})
pprint.pprint(result)
Example output
Using the miniture tinyllama model on an Mac M3 Processor, it has an execution time of 1 or two minutes.
{'query': 'Who were the main players in the race to complete the human genome? '
'And what were their approaches? Give as much detail as possible.',
'result': 'The main players in the race to complete the human genome were the '
'Human Genome Project (HGP), which was initiated by the US National '
'Institutes of Health (NIH) and the European Union (EU) in 1989, '
'and the International HapMap Consortium, which was formed in 2003 '
'to sequence the genomes of populations from around the world. The '
'HGP aimed to sequence the entire human genome, while the HapMap '
'project sequenced the genomes of populations from around the '
'world.\n'
'\n'
'The approaches taken by these projects were quite different. The '
'HGP was a large-scale, multinational effort that involved '
'thousands of scientists and researchers from various disciplines, '
'including genetics, biochemistry, molecular biology, and computer '
'science. The project was led by the NIH and the EU, with funding '
"provided by private companies and foundations. The HGP's goal was "
'to sequence the entire human genome in order to understand the '
'genetic basis of human health and disease.\n'
'\n'
'The HapMap project, on the other hand, was a smaller-scale effort '
'that involved only a few hundred scientists from various '
'institutions around the world. The project was led by the '
'University of California at Berkeley and the Broad Institute of '
'MIT and Harvard, with funding provided by private companies and '
"foundations. The HapMap's goal was to sequence the genomes of "
'populations from around the world in order to study genetic '
'variation among different ethnic groups.\n'
'\n'
'Both projects faced significant challenges and obstacles, '
'including technical difficulties, funding constraints, and '
'political and scientific controversies. However, despite these '
'challenges, both projects made significant progress towards '
"completing the human genome. The HGP's sequence of the entire "
"human genome was completed in 2015, while the HapMap project's "
'sequencing of populations from around the world was completed in '
'2013. Both projects have contributed significantly to our '
'understanding of human health and disease, as well as to advances '
'in medical research and clinical practice.',
'source_documents': [Document(metadata={'page_title': 'The Human Genome Project', 'source': 'https://plato.stanford.edu/entries/human-genome/'}, page_content='as a tool of governance, bioethics exercises power in ways often unseen, thereby foreclosing questions asked and debates had—for e.g., taking the boundary between facts and values as given not made, presenting rational moral arguments as outside politics to dismiss issues of public concern, or circumventing legislation by justifying the extension of existing regulations.'),
Document(metadata={'page_title': 'Human Genome Project (HGP) | History, Timeline, & Facts', 'source': 'https://www.britannica.com/event/Human-Genome-Project'}, page_content='carbohydrates and lipids. All these molecules work in concert to maintain the processes required for life. Are you a student? Get a special academic rate on Britannica Premium. Learn More Studies in molecular genetics led to studies in human genetics and the consideration of the ways in which traits in humans are inherited. For example, most traits in humans and other species result from a combination of genetic and environmental influences. In addition, some genes, such as those encoded at neighbouring spots on a single chromosome, tend to be inherited together, rather than independently, whereas other genes, namely those encoded on the mitochondrial genome, are inherited only from the mother, and yet other genes, encoded on the Y chromosome, are passed only from fathers to sons. Using data from the HGP, scientists have estimated that the human genome contains anywhere from 20,000 to 25,000 genes.'),
Document(metadata={'page_title': 'The human genome: 20 years of history - El·lipse', 'source': 'https://ellipse.prbb.org/the-human-genome-20-years-of-history/'}, page_content='such as genotype-tissue expression (GTEx), in which Guigó’s lab participates. The latter project attempts to understand the relationships between genome sequence variation between individuals and gene expression in different tissues. Although we still need to be able to explain “how the genomic variation of people has an impact on their function and results in differences such as, for example, the predisposition to have certain diseases,” concludes Guigó.'),
Document(metadata={'page_title': 'The Human Genome Project', 'source': 'https://plato.stanford.edu/entries/human-genome/'}, page_content='and potential funders, both public and private. This is particularly so in areas like genomics, where large amounts of sustained funding are required in order to achieve the hoped for scientific and translational goals” (p. 561). However, unlike Gibbs, Caulfield details possibilities of “real harm,” which include “potentially eroding public trust and support for science; inappropriately skewing research priorities and the allocation of resources and funding; creating unrealistic expectations of benefit for patients; facilitating the premature uptake of expensive and potentially harmful emerging technologies by health systems; misinforming policy and ethics debates; and accelerating the marketing and utilization of unproven therapies” (p. 567). The hype and hyperbole used to promote personalized (or precision) medicine carry the risks Caulfield mentions. Approaching 20 years since completion of the HGP, genome science has not revolutionized medicine or markedly improved human health.')]}
/Users/felix/PycharmProjects/offline_perplexity_clone/.venv/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py:290: UserWarning: `VectorDBQA` is deprecated - please use `from langchain.chains import RetrievalQA`
warnings.warn(
{'query': 'How were the donor participants recruited for the human genome '
'project? Summarize in three sentences.',
'result': 'The donor participants for the human genome project were recruited '
'through a variety of methods, including:\n'
'\n'
'1. Direct contact with potential donors via phone or email\n'
'2. Online advertisements on websites such as GENETIX and BioGPS\n'
'3. Referrals from other researchers and institutions\n'
'4. Posters displayed in public places\n'
'5. Flyers distributed at medical conferences and events\n'
'6. Social media campaigns, including Facebook and Twitter\n'
'7. Advertisements placed in scientific journals and newsletters\n'
'8. Personal invitations to potential donors from the research team '
'members themselves.',
'source_documents': [Document(metadata={'page_title': 'Human Genome Project (HGP) | History, Timeline, & Facts', 'source': 'https://www.britannica.com/event/Human-Genome-Project'}, page_content='carbohydrates and lipids. All these molecules work in concert to maintain the processes required for life. Are you a student? Get a special academic rate on Britannica Premium. Learn More Studies in molecular genetics led to studies in human genetics and the consideration of the ways in which traits in humans are inherited. For example, most traits in humans and other species result from a combination of genetic and environmental influences. In addition, some genes, such as those encoded at neighbouring spots on a single chromosome, tend to be inherited together, rather than independently, whereas other genes, namely those encoded on the mitochondrial genome, are inherited only from the mother, and yet other genes, encoded on the Y chromosome, are passed only from fathers to sons. Using data from the HGP, scientists have estimated that the human genome contains anywhere from 20,000 to 25,000 genes.'),
Document(metadata={'page_title': 'The Human Genome Project', 'source': 'https://plato.stanford.edu/entries/human-genome/'}, page_content='as a tool of governance, bioethics exercises power in ways often unseen, thereby foreclosing questions asked and debates had—for e.g., taking the boundary between facts and values as given not made, presenting rational moral arguments as outside politics to dismiss issues of public concern, or circumventing legislation by justifying the extension of existing regulations.'),
Document(metadata={'page_title': 'The human genome: 20 years of history - El·lipse', 'source': 'https://ellipse.prbb.org/the-human-genome-20-years-of-history/'}, page_content='such as genotype-tissue expression (GTEx), in which Guigó’s lab participates. The latter project attempts to understand the relationships between genome sequence variation between individuals and gene expression in different tissues. Although we still need to be able to explain “how the genomic variation of people has an impact on their function and results in differences such as, for example, the predisposition to have certain diseases,” concludes Guigó.'),
Document(metadata={'page_title': 'The Human Genome Project', 'source': 'https://plato.stanford.edu/entries/human-genome/'}, page_content='to focus on individual genetic differences within populations, not group genetic differences across populations. Pharmaceuticals, a powerful engine driving post-HGP research into human genetic differences, were supposed to be tailored to individual genomes. In 2003, Venter opposed the U.S. Food and Drug Administration (FDA) proposal to carry out pharmaceutical testing using the Office of Management and Budget (OMB) racial and ethnic classification system, arguing that these are “social” not “scientific” categories of race and ethnicity and that the promise of pharmacogenetics lies in its implementation as individualized medicine given the likelihood that variation in drug responses will vary more within racial and ethnic groups than among them (Haga and Venter 2003). However, en route to a “personalized” or “precision” medicine based on individual genetic differences and pharmaceuticals tailored to individual genomes, a detour via research into group genetic differences has been taken.')]}
Process finished with exit code 0
I think, as of 9/7/24, running this works best with ollama, which is in turn supported by langchain. I may think about updating this script. https://ollama.com/blog/embedding-models