glossagen.pipelines package¶

Submodules¶

glossagen.pipelines.generate_glossary module¶

Module for generating a glossary based on a research document.

class glossagen.pipelines.generate_glossary.GlossaryGenerator(research_doc: ResearchDoc, chunk_size: int = 20000)[source]¶

Bases: object

A class that generates a glossary based on a research document.

Attributes¶

research_doc (ResearchDoc): The research document to generate the glossary from. glossary_predictor (dspy.Predict): The predictor used to generate the glossary.

Methods¶

generate_glossary: Generates the glossary based on the research document.

deduplicate_entries(glossary: list[TerminusTechnicus]) → list[TerminusTechnicus][source]¶: Deduplicate the glossary entries by considering plurals and similar-sounding terms.

format_nicely(glossary: list[TerminusTechnicus]) → str[source]¶

Format the glossary nicely.

Args:: glossary (list[TerminusTechnicus]): The glossary to format.

Returns¶

str: The nicely formatted glossary.

generate_glossary_from_doc() → Any[source]¶: Generate the glossary based on the research document.

Returns¶

Any: The generated glossary.

normalize_term(term: str) → str[source]¶: Normalize a term by converting it to lowercase and removing common plural endings.

class glossagen.pipelines.generate_glossary.KeepImportantTerms(*, termini_technici: list[TerminusTechnicus], important_terms: list[TerminusTechnicus])[source]¶

Bases: Signature

Keep only the important terms from a list of termini technici.

important_terms: list[TerminusTechnicus]¶

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'important_terms': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of important terms extracted from the termini technici.\n NEEDS to be abbreviations or very important terms.', '__dspy_field_type': 'output', 'prefix': 'Important Terms:'}), 'termini_technici': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of termini technici extracted from the text.', '__dspy_field_type': 'input', 'prefix': 'Termini Technici:'})}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

termini_technici: list[TerminusTechnicus]¶

class glossagen.pipelines.generate_glossary.TerminusTechnicus(*, term: str, definition: str)[source]¶

Bases: BaseModel

A terminus technicus, i.e. a techincal term in materials science and chemistry.

definition: str¶

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'definition': FieldInfo(annotation=str, required=True, title='The definition of the technical term.'), 'term': FieldInfo(annotation=str, required=True, title='The technical term. Can also be an abbreviation.')}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

term: str¶

class glossagen.pipelines.generate_glossary.Text2GlossarySignature(*, text: str, glossary: list[TerminusTechnicus])[source]¶

Bases: Signature

Generating a list of termini technici from a text in materials science and chemistry.

glossary: list[TerminusTechnicus]¶

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'glossary': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of termini technici extracted from the text.\n ONLY TAKE VERY INPORTANT TERMS, no general terms like Chemistry.', '__dspy_field_type': 'output', 'prefix': 'Glossary:'}), 'text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'The text to extract the termini technici from.', '__dspy_field_type': 'input', 'prefix': 'Text:'})}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

text: str¶

glossagen.pipelines.generate_glossary.generate_glossary(document_directory: str, log_to_wandb_flag: bool = True) → Any[source]¶

Generate a glossary based on a research document.

Args:: document_directory (str): The directory where the research document is stored.

Returns¶

str: The generated glossary.

glossagen.pipelines.generate_glossary.log_to_wandb(glossary: list[TerminusTechnicus], chunk_size: int, project_name: str = 'GlossaGen', config: Dict[Any, Any] | None = None) → None[source]¶

Initialize wandb and log the generated glossary as a wandb.Table.

Args:: glossary (list[TerminusTechnicus]): The list of glossary terms to log. chunk_size (int): The size of the chunks the research document was split into. project_name (str): The name of the wandb project. config (dict): Configuration parameters for the wandb run.

glossagen.pipelines.generate_glossary.main() → None[source]¶: Demonstrate the generation of a glossary from a research document.

glossagen.pipelines.glossary_to_ontology module¶

Module to generate an ontology from a glossary.

class glossagen.pipelines.glossary_to_ontology.Glossary2Labels(*, input_text: str, labels: list[OntologyEntityLabels])[source]¶

Bases: Signature

Generating a list of entities labels (general overarching classes, e.g. Material, Property, …, Other) from glossary term and description pairs.

input_text: str¶

labels: list[OntologyEntityLabels]¶

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'input_text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'Glossary term and description pairs, one by line.', '__dspy_field_type': 'input', 'prefix': 'Input Text:'}), 'labels': FieldInfo(annotation=list[OntologyEntityLabels], required=True, json_schema_extra={'desc': 'List of general labels categorizing the terms.', '__dspy_field_type': 'output', 'prefix': 'Labels:'})}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class glossagen.pipelines.glossary_to_ontology.Glossary2Relations(*, input_text: str, relations: list[OntologyRelation])[source]¶

Bases: Signature

Generating a list of relations (single verb) between entities from glossary term and description pairs.

input_text: str¶

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'input_text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'Glossary term and description pairs, one by line.', '__dspy_field_type': 'input', 'prefix': 'Input Text:'}), 'relations': FieldInfo(annotation=list[OntologyRelation], required=True, json_schema_extra={'desc': '\n List of general relations between glossary terms for an ontology.\n Short terms or verbs describing the relation.', '__dspy_field_type': 'output', 'prefix': 'Relations:'})}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

relations: list[OntologyRelation]¶

class glossagen.pipelines.glossary_to_ontology.Ontology(*, labels: Dict[str, str], relationships: List[str])[source]¶

Bases: BaseModel

Represent an ontology with labels and relationships.

labels: Dict[str, str]¶

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'labels': FieldInfo(annotation=Dict[str, str], required=True), 'relationships': FieldInfo(annotation=List[str], required=True)}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

print_labels()[source]¶: Print all the labels in the ontology.

print_relationships()[source]¶: Print all the relationships in the ontology.

relationships: List[str]¶

class glossagen.pipelines.glossary_to_ontology.OntologyEntityLabels(*, label: str)[source]¶

Bases: BaseModel

An ontology label, i.e. an entity label in materials science and chemistry.

label: str¶

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'label': FieldInfo(annotation=str, required=True, title="A string 'label: description' describing a entity class.")}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class glossagen.pipelines.glossary_to_ontology.OntologyGenerator(glossary: Dict[str, str])[source]¶

Bases: object

A class that generates a ontology based on a glossary.

Attributes¶

research_doc (ResearchDoc): The research document to generate the glossary from. glossary_predictor (dspy.Predict): The predictor used to generate the glossary.

Methods¶

generate_glossary: Generates the glossary based on the research document.

generate_ontology_from_glossary(verbose: bool = False) → Any[source]¶: Generate the glossary based on the research document.

Returns¶

Any: The generated glossary.

class glossagen.pipelines.glossary_to_ontology.OntologyRelation(*, relation: str)[source]¶

Bases: BaseModel

Ontology relation, i.e. relation between two entities.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'relation': FieldInfo(annotation=str, required=True, title='A relation type of the ontology.')}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

relation: str¶

glossagen.pipelines.glossary_to_ontology.generate_ontology_from_glossary(document_directory: str) → Any[source]¶: Generate ontology from a glossary.

glossagen.pipelines.knowledge_graph module¶

Pipeline to generate a knowledge graph from research documents.

glossagen.pipelines.knowledge_graph.create_documents_from_text_chunks(text: str, max_length: int = 2000) → List[Document][source]¶: Create documents from text by dividing it into chunks of max_length characters.

glossagen.pipelines.knowledge_graph.main() → None[source]¶: Orchestrate graph generation from research documents.

glossagen.pipelines.latex_glossary module¶

Extract glossary from LaTeX document.

glossagen.pipelines.latex_glossary.extract_text_from_latex(latex_file_path: str) → str[source]¶: Extract readable text from .tex document, focusing on content between begin and end doc.

glossagen.pipelines.latex_glossary.main(latex_file_path: str) → None[source]¶: Extract glossary from LaTeX document.

glossagen.pipelines package¶

Submodules¶

glossagen.pipelines.generate_glossary module¶

Attributes¶

Methods¶

Returns¶

Returns¶

Returns¶

glossagen.pipelines.glossary_to_ontology module¶

Attributes¶

Methods¶

Returns¶

glossagen.pipelines.knowledge_graph module¶

glossagen.pipelines.latex_glossary module¶

Module contents¶